Spark
Resilient Distributed Dataset (RDD)
Write programs in terms of operations on distributed datasets.
Properties:
- Collections of objects spread across a cluster, stored in RAM or on Disk
- Built through parallel transformations
- Automatically rebuilt on failure
Always cache the branching point.
Fault Recovery
RDDs track lineage information that can be used to efficiently recompute lost data.
Creating RDDs
Basic Transformations
Basic Actions
Counts
nums.count()
→ this makes each worker count, and then the driver just add it up
len(nums.collect())
→ this makes the driver count it