Spark
Resilient Distributed Dataset (RDD)
RDDs allow us to write programs in terms of operations on distributed datasets.
Learned in CS451.
How is RDD different from HDFS?
HDFS is a distributed file system designed to store data across a cluster (so primarily for data storage).
RDD is a processing abstraction for performing distributed computation, optimized for in-memory operations.
- HDFS provides the persistent data layer, while RDD offers the high-performance computation layer.
This is why you load data from hdfs
with RDD:
Properties:
- Collections of objects spread across a cluster, stored in RAM or on Disk
- Built through parallel transformations
- Automatically rebuilt on failure
Operations
- Transformations (e.g. map, filter, groupBy)
- Actions (e.g. count, collect, save)
Always cache the branching point.
Fault Recovery
RDDs track lineage information that can be used to efficiently recompute lost data.
Creating RDDs
Basic Transformations
Basic Actions
Counts
nums.count()
→ this makes each worker count, and then the driver just add it up
len(nums.collect())
→ this makes the driver count it
Working with key-value pairs:
NOTE: You can write pets.reduceByKey(_ + _)
instead of pets.reduceByKey(lambda x, y: x + y)
.
This is how you can do word count in 2 lines of code!! As opposed to the wordy MapReduce implementation…
Other Key-Value Operations
Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter for number of tasks
If a file is split into 16 blocks on HDFS, and you open it with textFile(PathString, 10), you’ll still get 16 partitions, not 10.
What is the default level of parallelism?
It uses the same number of partitions for destination as the source has.
- Ex: a
reduceByKey
on an RDD with 8 partitions will result in another RDD with 8 partitions.
If you specify spark.default.parallelism
it will use this as the default instead! (For shuffles only, not parallelize textFile or other base RDDs)
Since Spark is lazy evaluated, we can start building a graph under the hood and be more optimized!