Spark Streaming
Spark Streaming Context
Usually a variable called ssc.
- Specify the time slice when creating. Default is 1 second
ssc can be used to create DStreams
A DStream has the RDD transforms. It also has its own transforms and actions.
Discrete Streams (DStream)
DStream – sequence of RDDs representing a stream of data
We can do a window-based reduce
- The second argument is an inverse reduce function
- How does that help?
- Well remember that we are doing it over a window. We need to be able to subtract the value of the first item of the window, and add the last item of the window!!
Dude this is literally Sliding Window.
Assume we have an DStream[String] called “words” and we want, every minute, to print the LONGEST word seen in the last minute.
words.reduceByWindow(f1, x, y).foreachRDD(f2)
val f1 = (x: String, y: String) => {(if x.length > y.length) x else y} val x = Minutes(1) val y = Minutes(1) val f2 = (rdd: RDD[String]) => {rdd.take(1).foreach(println)}