CS451
go find the latency website, latency numbers every programmer should know
https://student.cs.uwaterloo.ca/~cs451/syllabus.html
Really great teacher.
Concepts
A3:
- Need to get an more efficient reducer? Like doing (A,B): C instead of A: (B,C)?
- But aggregating the counts seems like a pain. It’s like doing pairs instead of stripes
Running the reference checks
hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndex \
-input data/Shakespeare.txt -output cs451-s36gong-a3-index-shakespeare
Query
hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrieval \
-index cs451-s36gong-a3-index-shakespeare -collection data/Shakespeare.txt \
-query "outrageous fortune AND"
hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrieval \
-index cs451-s36gong-a3-index-shakespeare -collection data/Shakespeare.txt \
-query "white red OR rose AND pluck AND"
First output
Query: outrageous fortune AND
1073319 The slings and arrows of outrageous fortune
Second output
Query: white red OR rose AND pluck AND
1713274 From off this brier pluck a white rose with me.
1713428 Pluck a red rose from off this thorn with me.
1713566 I pluck this white rose with Plantagenet.
1713612 SUFFOLK. I pluck this red rose with young Somerset,
1714623 In sign whereof I pluck a white rose too.
Chapter 4: Dealing with Text
The goal is to generate text. We can use a probabilistic model.
P(“I saw a van”) = P(“I”) x P(“saw” | “I”) x P(“a” | “I saw”) x P(“van” | “I saw a”)
But how feasible is this?
- Not really. We use a smaller limit: the n-gram
- Limit of words
Basic Idea: Probability of next word only depends on the previous (N – 1) words
- N = 1 : Unigram Model- P(w1 ,w2 ,w3 ,…) = P(w1 ) P(w2 ) … P(wk )
- N = 2 : Bigram Model -
State of the art rarely goes above 5-gram.
We apply laplace smoothing for the bigram probabilities.
A posting consists of a record that associates a specific word (or term) with a set of documents in which that word appears.
Terminology:
- Inverted Index. Maps context to documents.
- Forward Index. Maps documents to context. (Seems strange to me, so a book’s index is “inverted”?)
- so the word is simply a mapped to in which document it is in
Postings size use Zipf’s Law:
- is the number of elements
- is the rank, anywhere between and
- is the characteristic exponent
Zipf’s Law: (also) linear in log-log space
- A few elements occur very frequently
- Many elements occur very infrequently
- Ohh, it’s like the 80-20 Rule - https://www.youtube.com/watch?v=fCn8zs912OE&t=253s&ab_channel=Vsauce
Instead of using that structure, use a linked list!!
Boolean Retrieval
Hits: Set of documents for which the query is true
Ranked retrieval is introduced.
- The relevance of a document is the sum of the wi value
Chapter 5: Graphs
How do we do mapreduce on graphs?
Data Mining: Part 1
A/B testing for ML: Step 1: Offline training and evaluation (holdout, cross-validate, etc) Step 2: A/B testing vs other methods
Data Mining: Part 2
Problem:
- S – Set of Objects
- D(a,b) – Distance from object a to b
- t – maximum distance threshold
Goal: Find all unordered pairs (a,b) s.t. D(a,b) ≤ t
- Find the top 42 matching images, and then fill it in
Capturing similarity is difficult
- Images
- Protein
- Text
Given high-dimensional datapoints and some distance function : Find all pairs of datapoints s.t.
The naïve approach: just compute for all – not very big data Magic: – normal to want, and possible to achieve.
- Use LSH
Why do we care? The core technique applies to ML!
(Scalar) LSH to find Near Neighbours
- Make a Hash Table
- Use buckets that are wide (and overlapped, so items go in multiple buckets)
- Most values s.t. < c will have at least 1 bucket in common
- Most values > c will NOT share a common bucket
Part 2 of data mining
Measuring distance between text documents: Remember, one goal is detecting duplicate news stories
- Diff: Too sensitive to rearrangement
- Word Count / Bag of Words: not sensitive enough to rearrangement
- Edit Distance: too difficult to compute
- doc2vec cosine distance: (too advanced)
- N-Grams: Just right
Jaccard How do n-grams give us a distance measure? Jaccard distance! This is used for sets.
Do we have sets? Yes: A document embedding is the set of n-grams it contains.
Here’s how we compute the similarity metric:
These are the distances
What
n
to use?What n should I use?
There’s a few factors:
- Depends on if you’re doing byte-level or word-level n-grams
- Depends on what size of documents
For byte-level:
- 4-5 for short documents (tweets, sms, emails)
- 8-10 for long documents (webpages, books, papers)
For word-level:
- 2-3 for short documents
- 3-5 for long documents
Min-hashing (two views) Imagine you have a random permutation .
- Do this exercise, you will understand how the signature matrix is calculated, but it’s pretty straightforward