CS451: Data-Intensive Distributed Computing

https://student.cs.uwaterloo.ca/~cs451/syllabus.html

For Steven: You’re probably looking for Latency numbers every programmer should know.

Really great teacher. Has gone downhill as the term progressed. Became boring. Though slides are pretty good.

Things to master for final:

  • LSH-hashing
  • DStream??
  • Syntax for Spark (things like reducebykey, tf does that do?)
  • Bigram frequency, boolean retrieval, page rank, how the logic is actually implemented?? cuz i forgot

Concepts

A3:

  • Need to get an more efficient reducer? Like doing (A,B): C instead of A: (B,C)?
    • But aggregating the counts seems like a pain. It’s like doing pairs instead of stripes

Running the reference checks

hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BuildInvertedIndex \
   -input data/Shakespeare.txt -output cs451-s36gong-a3-index-shakespeare

Query

hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrieval \
   -index cs451-s36gong-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "outrageous fortune AND"
   
hadoop jar target/assignments-1.0.jar ca.uwaterloo.cs451.a3.BooleanRetrieval \
   -index cs451-s36gong-a3-index-shakespeare -collection data/Shakespeare.txt \
   -query "white red OR rose AND pluck AND"

First output

Query: outrageous fortune AND
1073319     The slings and arrows of outrageous fortune

Second output

Query: white red OR rose AND pluck AND
1713274     From off this brier pluck a white rose with me.
1713428     Pluck a red rose from off this thorn with me.
1713566     I pluck this white rose with Plantagenet.
1713612   SUFFOLK. I pluck this red rose with young Somerset,
1714623     In sign whereof I pluck a white rose too.

Chapter 2 and 3

Those notes are in MapReduce, RDD and Spark.

Chapter 4: Dealing with Text

The goal is to generate text. We can use a probabilistic model.

P(“I saw a van”) = P(“I”) x P(“saw” | “I”) x P(“a” | “I saw”) x P(“van” | “I saw a”)

But how feasible is this?

  • Not really. We use a smaller limit: the n-gram
  • Limit of words

Basic Idea: Probability of next word only depends on the previous (N – 1) words

  • N = 1 : Unigram Model- P(w1 ,w2 ,w3 ,…) = P(w1 ) P(w2 ) … P(wk )
  • N = 2 : Bigram Model -

State of the art rarely goes above 5-gram.

We apply laplace smoothing for the bigram probabilities.

A posting consists of a record that associates a specific word (or term) with a set of documents in which that word appears.

Terminology:

  • Inverted Index. Maps context to documents.
  • Forward Index. Maps documents to context. (Seems strange to me, so a book’s index is “inverted”?)
  • so the word is simply a mapped to in which document it is in

Postings size use Zipf’s Law:

  • is the number of elements
  • is the rank, anywhere between and
  • is the characteristic exponent

Zipf’s Law: (also) linear in log-log space

Instead of using that structure, use a linked list!!

Boolean Retrieval

Hits: Set of documents for which the query is true

Ranked retrieval is introduced.

  • The relevance of a document is the sum of the wi value

Chapter 5: Graphs

How do we do mapreduce on graphs?

See

Data Mining: Part 1

A/B testing for ML: Step 1: Offline training and evaluation (holdout, cross-validate, etc) Step 2: A/B testing vs other methods

Data Mining: Part 2

Problem:

  • S – Set of Objects
  • D(a,b) – Distance from object a to b
  • t – maximum distance threshold

Goal: Find all unordered pairs (a,b) s.t. D(a,b) ≤ t

  • Find the top 42 matching images, and then fill it in

Capturing similarity is difficult

  • Images
  • Protein
  • Text

Given high-dimensional datapoints and some distance function : Find all pairs of datapoints s.t.

The naïve approach: just compute for all – not very big data Magic: – normal to want, and possible to achieve.

  • Use LSH

Why do we care? The core technique applies to ML!

(Scalar) LSH to find Near Neighbours

  1. Make a Hash Table
  2. Use buckets that are wide (and overlapped, so items go in multiple buckets)
  3. Most values s.t. < c will have at least 1 bucket in common
  4. Most values > c will NOT share a common bucket

• Sets as vectors: Measure similarity by the cosine distance • Sets as sets: Measure similarity by the Jaccard distance • Sets as points: Measure similarity by Euclidean distance

Part 2 of data mining

Measuring distance between text documents: Remember, one goal is detecting duplicate news stories

  • Diff: Too sensitive to rearrangement
  • Word Count / Bag of Words: not sensitive enough to rearrangement
  • Edit Distance: too difficult to compute
  • doc2vec cosine distance: (too advanced)
  • N-Grams: Just right

Jaccard How do n-grams give us a distance measure? Jaccard distance! This is used for sets.

Do we have sets? Yes: A document embedding is the set of n-grams it contains.

Here’s how we compute the similarity metric:

These are the distances

What n to use?

What n should I use?

There’s a few factors:

  • Depends on if you’re doing byte-level or word-level n-grams
  • Depends on what size of documents

For byte-level:

  • 4-5 for short documents (tweets, sms, emails)
  • 8-10 for long documents (webpages, books, papers)

For word-level:

  • 2-3 for short documents
  • 3-5 for long documents

Min-Hashing

Chapter 7: Relational Data

Database Workloads OLTP (Online Transaction Processing)

  • Most Applications:
    • E-Commerce, Banking, Reddit, etc.
  • User Facing: Must be fast, low latency, concurrent (many users)
  • Tasks: small set of common queries
  • Access: Random reads, small writes

OLAP (Online Analytical Processing)

  • BI and Data Mining
  • Back-End: Batch workloads, low concurrency
  • Tasks: Complex Analytics (Ad Hoc)
  • Access: Full Table Scans, Big Data

Two patterns, one database

To satisfy both needs, we use a data warehouse

So how do we do all these operations in mapreduce?

  • UNION Hadoop MapReduce has a MultipleInputFile class

SUBSTRACT

  • MultipleInputFiles again – Each mapper has:
    • Key: An Entire Tuple, plus “Which Mapper Sent Me”
    • Value: Not used
    • Sort “RHS” tuples before equal “LHS” tuples
  • Reducer
    • Remember last RHS tuple
    • Emit LHS tuples only if they do not equal the last RHS

CROSS PRODUCT

  • NO, GOD NO

Inner joins

Chapter 10: Mutable State

https://student.cs.uwaterloo.ca/~cs451/slides/09%20-%20Mutable%20State%20-%20Combined.pdf

Chapter 9

Bloom Filter