My goals in this talk:
Spark started as a research project at the University of California Berkeley’s Algorithms, Machines, People Lab (AMPLab) in 2009. The project's goal was to develop in-memory cluster computing, avoiding MapReduce's reliance on heavy I/O use.
The first open source release of Spark was 2010, concurrent with a paper from Matei Zaharia, et al.
In 2012, Zaharia, et al release a paper on Resilient Distributed Datasets.
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
Add all of this together and you have the key component behind Spark.
One of the first additions to Spark was SQL support, first with Shark and then with Spark SQL.
With Apache Spark 2.0, Spark SQL can take advantage of Datasets (strongly typed RDDs) and DataFrames (Datasets with named columns).
Spark SQL functions are accessible within the SparkSession object, created by default as “spark” in the Spark shell.
Apache Spark is a versatile platform and is popular for several things:
Apache Spark has "first-class" support for three languages: Scala, Python, and Java.
In addition to this, it has "second-class" support for two more languages: SQL and R.
Further, there are projects which extend things further, such as .NET for Apache Spark, which adds C# and F# support.
There are several ways to use Apache Spark in the Azure ecosystem, such as:
Pipelines are a common metaphor in computer science. For example:
Environment | Example |
---|---|
Unix (sh) | grep "Aubrey" Employee.txt | uniq | wc -l |
F# | movies |> Seq.filter(fun m -> m.Title = Title) |> Seq.head |
Powershell | gci *.sql -Recurse | Select-String "str" |
R | inputs %>% filter(val > 5) %>% select(col1, col2) |
In all of the prior examples, sections of code combine together as connected pipes in a pipeline, with data flowing through them and transforming at each step.
In addition to these code-based pipelines, we have many examples of graphical pipelines.
The pipeline metaphor works extremely well:
Each segment of code is a transformer which modifies the fluid in some way.
Transformers are connected together with pipes.
Data acts as a fluid, moving from transformer to transformer via the pipes.
Some transformers are "streaming" or non-blocking, meaning that fluid pushes through without collecting. Think of a mesh filter which strains out particulate matter.
Some transformers are blocking, meaning that they need to collect all of the data before moving on. Think of a reservoir which collects all of the fluid, mixes it with something, and then turns a flow control valve to allow the fluid to continue to the next transformer.
Backpressure happens when a downstream component processes more slowly than upstream components. In the case of sufficient backpressure, the fluid may stop pushing forward and can back up.
Breaking our metaphor, data doesn't move exactly like a fluid. Instead, it typically moves in buffers: discrete blocks of information held in memory and transferred from one stage to the next. We see this as a specific number of rows processed at a time or a specific amount of data processed at a time.
The way to regulate backpressure in a data system is to wait for a downstream component to ask for a buffer before sending it and processing the next. Now, we avoid the risk of out-of-memory errors from creating too many buffers or buffers being lost because the downstream component can't collect them.
Databricks, the commercial enterprise behind Apache Spark, makes available the Databricks Unified Analytics Platform in AWS and Azure. They also have a Community Edition, available for free.
First, search for "databricks" and select the Azure Databricks
option.
The concept of the data lake comes from a key insight: not all relevant data fits the structure of a data warehouse. Furthermore, there is a lot of effort involved in adding new data to a warehouse.
Data is typically stored in the data lake as files, and may include delimited files, files using specialized formats (ORC, Parquet), or even binaries (images, videos, audio) depending on the need.
We usually think of three layers of a data lake, which Databricks called "bronze," "silver," and "gold."
The process of cleaning, refining, enriching, and polishing data allows us to migrate it from bronze to silver to gold. All of this happens through data pipelines, either automated piplines (which can feed results into a data warehouse or other analytics platform) or manual pipelines for ad hoc research.
Databricks has coined the term Lakehouse to represent the combination of data warehouse and data lake in one managed area.
The Microsoft Team Data Science Process is one example of a process for implementing data science and machine learning tasks.
Spark supports machine learning using the native MLlib library which works with RDDs.
On top of this, there is a spark.ml
namespace which works with DataFrames.
Experiments give you an opportunity to group together different trials in solving a given problem.
Once you've landed on an answer, register your trained model.
Serving a model is easy through the UI.
Just as with data loading, we'll want to create repeatable processes to train, register, and serve models, and one answer is the same as before: notebook workflows.
Databricks jobs allow us to schedule the execution of notebooks, Java executables (JAR files), Python scripts, and spark-submit
calls.
When creating a job, it's best to use a purpose-built cluster--this costs $0.25 per DBU-hour less than all-purpose compute clusters, which can be a considerable cost savings over the course of a month.
Each job can run manually or on a fixed schedule of your choosing.
Review job results after a run to see what it did and to investigate any errors which pop up.
Over the course of this talk, we have learned about the basics of Apache Spark, data pipelines, and data lakes. Along the way, we created notebook workflows for data lake population as well as machine learning.
To learn more, go here:
https://csmore.info/on/pipeline
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact