Relational Operations in MapReduce

Learned in CS451.

How do you implement the operators of Relational Algebra when your data lives across a cluster and your only primitives are map and reduce?

UNION

Hadoop MapReduce has a MultipleInputFile class — feed both relations into the same job and the mappers emit everything.

SUBTRACT (A − B)

Use MultipleInputFiles again:

  • Each mapper emits:
    • Key: an entire tuple, plus “which mapper sent me”
    • Value: (unused)
  • Sort RHS tuples before equal LHS tuples.
  • Reducer:
    • Remember the last RHS tuple seen.
    • Emit an LHS tuple only if it does not equal the last RHS tuple.

CROSS PRODUCT

See Cross Product. NO, GOD NO — don’t do this on big data.

Inner Joins