Apache Spark
Apache Spark is an open-source unified analytics engine for large-scale data processing. Used it for Aquantix AI.
Resilient Distributed Dataset
What is Spark?
Efficient
- General execution graphs
- In-memory storage
Usable
- Rich APIs in Java, Scala, Python
- Interactive shell
Spark is lazy evaluated
This helps Spark to optimize the execution and load only the data that is really needed for evaluation.
Note that these do NOT sort, they use in-memory hash tables for the shuffle, not sorted
files. (Spark’s design assumes a lot more RAM than MapReduce does)
groupByKey
– like MapReduces shuffle. NOT the reduce part, this is JUST the shuffle that brings the pairs to a single place. You’d then use map, flatMap, etc. to perform the reduce action itself.
reduceByKey
– like MapReduce’s shuffle + combine + reduce.
What does a worker do to perform reduceByKey(f(a,b))
?
- Create a Hash table called
HT
- For each (K,V) pair in a. If is not a key in , associate with in . b. Otherwise, retrieve the old value from , and replace it with
- Perform a shuffle – each reducer-like-worker will receive key-value pairs. It will then repeat step 2 for all received pairs. aggregateByKey – a more complicated reduceByKey aggregateByKey(zero, insert, merge)
- Create a Hash table called HT
- For each (K,V) pair in RDDin a. If k is not a key in HT, associate k with insert(zero, v) in HT. b. Otherwise, retrieve the accumulator u from HT, and replace it with insert(u,v)
- Perform a shuffle – each reducer-like-worker will receive key-value pairs. The third parameter, merge, is used to combine accumulators (There’s also combineByKey which is the same, except instead of a zero-value, you give it another function, one that creates an accumulator out of a single value V. Technically combineByKey is the only “real” function – aggregateByKey(zero, insert, merge) calls combineByKey(lambda v: insert(zero, v), insert, merge), and reduceByKey(f) calls combineByKey(identity,f,f)
SparkContext
- Main entry point to Spark functionality
- Available in shell as variable
sc
- In standalone programs, you’d make your own
Setting the Level of Parallelism
All the pair RDD operations take an optional second parameter for number of tasks