Data Stream

Spark Streaming

Spark Streaming Context

Usually a variable called ssc.

  • Specify the time slice when creating. Default is 1 second

ssc can be used to create DStreams

A DStream has the RDD transforms. It also has its own transforms and actions.

Discrete Streams (DStream)

DStream – sequence of RDDs representing a stream of data

We can do a window-based reduce

val tagCounts = hashtags .countByValueAndWindow(Minutes(10), Seconds(1))
 
val tagCounts = hashtags.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(1))
tagCounts = hashtags.reduceByKeyAndWindow(lambda x,y:x+y, lambda x,y:x-y, Minutes(10), Seconds(1))
  • The second argument is an inverse reduce function
  • How does that help?
    • Well remember that we are doing it over a window. We need to be able to subtract the value of the first item of the window, and add the last item of the window!!

Dude this is literally Sliding Window.

Assume we have an DStream[String] called “words” and we want, every minute, to print the LONGEST word seen in the last minute. 

words.reduceByWindow(f1, x, y).foreachRDD(f2)

val f1 = (x: String, y: String) => {(if x.length > y.length) x else y} val x = Minutes(1) val y = Minutes(1) val f2 = (rdd: RDD[String]) => {rdd.take(1).foreach(println)}