Catallaxy Services | Getting Started with Spark Streaming

ABSTRACT

With the broader adoption of message brokers like Apache Kafka as well as distributed, message-sending architectures, the need for tools which can process vast amounts of data quickly became critical. To fill this need, we have several competing products, including Spark Streaming. In this talk, we will understand the use cases for stream processing and how Spark's concept of distributed batch processing reduces down to micro batches in the streaming case. We will understand the two streaming models for Spark, DStreams and Structured Streaming with DataFrames, and will see examples of streaming applications in Scala and F#.

ADDITIONAL MEDIA

No recordings or additional media are available for this talk.

SLIDES

Click here to access the slides for this presentation.

The slides are licensed under Creative Commons Attribution-ShareAlike.

DEMO CODE

Click here to access demo code for this presentation.

The source code is licensed under the terms offered by the GPL.

LINKS & FURTHER INFO

Additional Resources

Arush Kharbanda provides a guide to Spark Streaming.

Data Flair has a tutorial for beginners.
Sandeep Dayananda walks us through a sentiment analysis demo using Spark Streaming.
Yaroslav Tkachenko contrasts DStreams and DataFrames.
The Spark Streaming Programming Guide is another good starting point for learning.
The Microsoft.Spark project includes several examples in C# and F# for us.
Kundan Kumarr walks us through an example of combining Apache Kafka and Cassandra via Spark Streaming.
Microsoft Docs gives a good account of Structured Streaming using DataFrames.
The Databricks documentation also gives a good account of Structured Streaming

.

The Databricks documentation has things to consider before going to production with a Structured Streaming product.
Spark By Examples shows how to read data from a TCP socket with Spark Streaming.
Sarfaraz Hussain has a four-part series on Spark Structured Streaming. Part one is an introduction to the topic. Part two covers some of the basics of query structure and checkpointing. Part three introduces the concept of stateful streaming. Part four covers late-arriving data.
Ligh-rain explains the two window types for Spark Streaming: tumbling and sliding. This paper calls sliding windows the same as hopping windows, but there's a minor difference between the two and properly speaking, Spark Streaming is sliding, not hopping.
Microsoft explains different window types. This is specifically for Azure Stream Analytics, so Spark Streaming doesn't support all of these, but it does give you a good idea of the sorts of windows you might find in products.