Abstract

As companies work to gain insight from ever-increasing amounts of data, data platform practitioners need tools which can scale along with the data. Early big data solutions in the Hadoop ecosystem assumed that data sizes overwhelmed available memory, emphasizing heavy disk usage to coordinate work between nodes. As the cost of memory decreases and the amount of memory available per server increases, we see a shift in the makeup of big data systems, emphasizing heavy memory usage instead of disk. Apache Spark, which focuses on memory-intensive operations, has taken advantage of this hardware shift to become the dominant solution for problems requiring distributed data. In this talk, we will take an introductory look at Apache Spark. We will review where it fits in the Hadoop ecosystem, cover how to get started and some of the basic functional programming concepts needed to understand Spark, and see examples of how we can use Spark to solve issues like calculating PageRank and analyzing large data sets.


Demo Code

The demonstration code is available on my GitHub repository. This includes a set of Zeppelin notebooks which walk through our examples.

The source code is licensed under the terms offered by the GPL. The slides are licensed under Creative Commons Attribution-ShareAlike.


Links and Further Information

Books

Courses