Batch Processing

Learned in SE464.

In batch processing, the data is already there, and you process it in batches.

When batch is appropriate

Hard scalability problems, e.g., building a recommendation system from Amazon order history:

  • Perform a calculation using all the data from millions of customers
  • Requires lots of coordination, housing data on the compute nodes
  • Must use MapReduce, Spark, etc.
  • Luckily, it’s OK if the calculation takes several hours

Batch vs Stream

Key differentiators:

  • Purpose
  • Data handling
  • Processing requirements

Batch is for large data volumes. Stream is for real-time analytics, low latency. Large data in batches vs streams of continuous data.