Spark started as a research project at the University of California Berkeley’s Algorithms, Machines, People Lab (AMPLab) in 2009. The project's goal was to develop in-memory cluster computing, avoiding MapReduce's reliance on heavy I/O use.
The first open source release of Spark was 2010, concurrent with a paper from Matei Zaharia, et al.
In 2012, Zaharia, et al release a paper on Resilient Distributed Datasets.
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
Add all of this together and you have the key component behind Spark.
Conceptually, Spark Streaming allows us to work with Resilient Distributed Datasets over time.
DStreams are simply time-aware RDDs. Instead of using backward-looking historical data, we use forward-looking near-present data.
To maximize performance, Spark tends to wait a certain amount of time and build a microbatch--this reduces the cost of processing overhead by packing more than one record into a DStream.
With Apache Spark 2.0, the model shifted from Resilient Distributed Datasets to Datasets and DataFrames.
Datasets are strongly-typed RDDs.
DataFrames are Datasets with named columns (Dataset[Row]
in Scala). DataFrames are untyped in Python and R, and in all languages slice data into named columns.
Datasets and DataFrames provide several advantages over RDDs:
Spark Streaming has two key types of windows: tumbling and sliding. Suppose we have events which happen over time:
In a tumbling window, we have non-overlapping intervals of events captured during a certain time frame.
In a sliding window, we have potentially-overlapping intervals. We have a window length (in units of time) and a sliding window interval (in units of time).
Our dataset includes data from Wake County, North Carolina fire incidents. Fire incidents occur for a variety of reasons, including fires, fire drills, medical emergencies, and even getting cats out of trees or assisting the elderly.
We have a collection of fire incidents and we will send these from a local service into Spark Structured Streaming. The easiest way to do this with Azure Databricks is to ingest the data via an Azure Event Hub.
We need to create an Event Hub namespace in Azure, as well as an event hub into which we will send data.
We need to create an Event Hub shared access policy for Databricks. This only needs Listen because we're just consuming the data, not producing new records.
Azure Databricks can store secrets internally or use Azure Key Vault. Because Key Vault is the norm for Azure secrets management, we'll store our secrets there.
In Databricks, we need to create a new secrets scope in order to access Key Vault. The URL for this is https://[databricks_url]#secrets/createScope
.
We will also need the Azure Event Hubs for Spark library, which is available in Maven.
We've only scratched the surface of Spark Streaming. Additional topics of interest include:
To learn more, go here:
https://csmore.info/on/sparkstreaming
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact