Spark

Resilient Distributed Dataset (RDD)

Write programs in terms of operations on distributed datasets.

Properties:

  • Collections of objects spread across a cluster, stored in RAM or on Disk
  • Built through parallel transformations
  • Automatically rebuilt on failure
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache() # this caches the messages
 
messages.filter(lambda s: “mysql” in s).count()
messages.filter(lambda s: “php” in s).count()

Always cache the branching point.

Fault Recovery

RDDs track lineage information that can be used to efficiently recompute lost data.

Creating RDDs

# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
# Load text file from local FS, HDFS, or S3
sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)

Basic Transformations

nums = sc.parallelize([1, 2, 3])
 
# Pass each element through a function
squares = nums.map(lambda x: x*x) // {1, 4, 9}
 
# Keep elements passing a predicate
even = squares.filter(lambda x: x % 2 == 0) // {4} 
 
# Map each element to zero or more others
>nums.flatMap(lambda x: => range(x))
# => {0, 0, 1, 0, 1, 2}

Basic Actions

nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collection (not used often)
nums.collect() # => [1, 2, 3]
 
# Return first K elements
nums.take(2) # => [1, 2]
 
# Count number of elements
nums.count() # => 3
 
# Merge elements with an associative function
nums.reduce(lambda x, y: x + y) # => 6
 
# Write elements to a text file
nums.saveAsTextFile("hdfs://file.txt")

Counts

  • nums.count() this makes each worker count, and then the driver just add it up
  • len(nums.collect()) this makes the driver count it