🛠️ Steven Gong

Search

Data Stream
Concepts

Dec 16, 2024, 2 min read

Data Stream

Learned in CS451.

Apache Kafka – A distributed pub-sub broker designed for resilience and availability
Apache Flume – A distributed log aggregator (designed for resilience and availability)
Stream-Processing-Engine – The top of this module
HBase – Distributed key-value datastore, modelled after Google’s BigTable and built on top of HDFS

A Data Stream

A data stream is a sequence of items (tuples)

Structured

Ordered (either a timestamp, or implicitly by arrival time)

Continuously arriving

High volume

Might not be possible to store all of it

Might not be possible to even examine all of it

The main question is: How do you process a stream of data, when you don’t have all of it?

The problem is that the “reduce-like” tasks – grouping, aggregating, and joining – rely on having all of the data. How can we define them “continuously?”

Easy: You can just do your reduce operations over a window.

There are 2 implementations:

Sliding
Tumbling

Windows: Define all “reduce-like” transformations for a given window.

There are many ways to window:

Based on ordering attribute (timestamp)
Based on counts (last X records)
Based on explicit markers (ex: punctation)

When there is bursty data, you apply Reservoir Sampling.

Concepts

Spark Streaming

Graph View

Backlinks

Spark Streaming

Created with Quartz, © 2025

Blog
LinkedIn
Twitter
GitHub