MapReduce
So this exists before CUDA times.
Solve Embarrassingly Parallel problems.
Hemal Shah is telling me that this was one of the revolutionary ideas that enabled us to train neural networks at scale. And the stuff that the Google Brain team was doing was super duper cool.
From CS451
MapReduce is based around Key-Value Pairs This is a common way to break things down!
If the input is a text file:
- Key – Position of a line
- Value – Text of a line.
The programmer defines two functions:
- map:
- reduce:
A simple word count example
Bigram implementations https://git.uwaterloo.ca/cs451/bespin/-/blob/master/src/main/java/io/bespin/java/mapreduce/bigram/BigramCount.java
In-Mapper Combiner (IMC)
Counting co-occurrence (Pairs)
Term: Co-Occurrence
- : number of times word and word cooccur in some context
- E.g. how many times is followed immediately by in a sentence is , where is the vocabulary
How do you deal with keys that are pairs?
Two options:
- Pair – We’ll be computing individual Cells
- Stripe – We’ll be computing individual Rows
First option: combine the keys into a pair.
Input: Sentence
Output: ((a,b), 1)
for all pairs of words a,b
in the sentence.
Disadvantage
That’s a lot of pairs to be fair (order matters). The combiner also doesn’t do much?
- The combiner won’t do much because there are N x N potential keys. Most keys will have few entries, so there will be few cases where the combiner reduces the number of pairs.
Second Option: Use Stripes
Input: Sentence Output: , where:
- is a word from the input
- are all words that co-occur with a is the number of times co-occur
- means a map (aka a dictionary, associative array, etc)
- Fewer key-value pairs to send
- Combiners will do more work
- Map is a heavier object than a single
- More computationally intensive for the mappers and combiners.
Intuition
Better CPU utilization in using stripes.
Why doesn’t it grow polynomially? Like don’t you have O(n^2) different pairs?
In a practical MapReduce setup, each mapper processes smaller subsets of the dataset (for example, one document, basket, or chunk of data at a time).
- Suppose there are n words and k sentences
- Each worker processes m = n/k words
- Each mapper does O(m^2) work.
In most real-world datasets, (the number of items per basket/document) is relatively constant or small compared to , which is the total number of baskets/documents.
Thus, the total work done by all mappers across the dataset is roughly proportional to .
So the trick is by reducing the constant time factor.
Should you always use stripes?
It’s a tradeoff. “Easier to understand and implement” is NOT bad. For English words and normal sentence lengths, the stripe fits in memory easily. It won’t always work out that way.
Another Problem, Relative Frequencies Where
- is number of co-occurrences of and , and
- is the sum of over all
Why do we want to do this? How do we make it fit into MapReduce?
#todo CHECK about the two passes thing
We want to do this, but we don’t have access to freq(a)
:( This is only known at the reducer level.
DOESN’T WORK
We need to do something elaborate.
- The reducer works! That’s because Hadoop sorts the keys. Pairs will sort lexicographically by the first key, and then the second. So So comes before any because the ASCII code of * is 42.