Batch Processing
Learned in SE464.
In batch processing, the data is already there, and you process it in batches.
When batch is appropriate
Hard scalability problems, e.g., building a recommendation system from Amazon order history:
- Perform a calculation using all the data from millions of customers
- Requires lots of coordination, housing data on the compute nodes
- Must use MapReduce, Spark, etc.
- Luckily, it’s OK if the calculation takes several hours
Batch vs Stream
Key differentiators:
- Purpose
- Data handling
- Processing requirements
Batch is for large data volumes. Stream is for real-time analytics, low latency. Large data in batches vs streams of continuous data.