Relational Operations in MapReduce
Learned in CS451.
How do you implement the operators of Relational Algebra when your data lives across a cluster and your only primitives are map and reduce?
UNION
Hadoop MapReduce has a MultipleInputFile class — feed both relations into the same job and the mappers emit everything.
SUBTRACT (A − B)
Use MultipleInputFiles again:
- Each mapper emits:
- Key: an entire tuple, plus “which mapper sent me”
- Value: (unused)
- Sort RHS tuples before equal LHS tuples.
- Reducer:
- Remember the last RHS tuple seen.
- Emit an LHS tuple only if it does not equal the last RHS tuple.
CROSS PRODUCT
See Cross Product. NO, GOD NO — don’t do this on big data.
Inner Joins


