Hadoop started as a pair of Google whitepapers: the Google File System (released in 2003) and MapReduce (2004). Doug Cutting, while working at Yahoo, applied these concepts to search engine processing. The first public release of Hadoop was in early 2006.
Since then, Hadoop has taken off as its own ecosystem, allowing companies to process petabytes of data efficiently over thousands of machines.
The hardware paradigm during the early years:
This hardware paradigm drove technical decisions around data storage, including the Hadoop Distributed Filesystem (HDFS).
The software paradigm during the early years:
This led to node types, semi-structured data storage, and MapReduce.
There are two primary node types in Hadoop: the NameNode and data nodes.
The NameNode (aka control or head node) is responsible for communication with the outside world, coordination with data nodes, and ensuring that jobs run.
Data nodes store data and execute code, making results available to the NameNode.
Hadoop follows a "semi-structured" data model: you define the data structure not when adding files to HDFS, but rather upon retrieval. You can still do ETL and data integrity checks before moving data to HDFS, but it is not mandatory.
In contrast, a relational database has a structured data model: queries can make good assumptions about data integrity and structure.
Semi-structured data helps when:
MapReduce is built around two FP constructs:
MapReduce combines map and reduce calls to transform data into desired outputs.
The nodes which perform mapping may not be the same nodes which perform reduction, allowing for large-scale performance improvement.
Some of these changes precipitated the research project which became Apache Spark.
Spark started as a research project at the University of California Berkeley’s Algorithms, Machines, People Lab (AMPLab) in 2009. The project's goal was to develop in-memory cluster computing, avoiding MapReduce's reliance on heavy I/O use.
The first open source release of Spark was 2010, concurrent with a paper from Matei Zaharia, et al.
In 2012, Zaharia, et al release a paper on Resilient Distributed Datasets.
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
The Resilient Distributed Dataset (RDD) forms the core of Apache Spark. It is:
Add all of this together and you have the key component behind Spark.
We have several options available to install Spark:
We will look at a standalone installation but use Databricks UAP for demos.
Step 1: Install the Java Development Kit. I recommend getting Java Version 8. Spark is currently not compatible with JDKs after 8.
Step 2: Go to the Spark website and download a pre-built Spark binary.
You can unzip this .tgz file using a tool like 7-Zip.
Step 3: Download WinUtils. This is the 64-bit version and should be 110KB. There is a 32-bit version which is approximately 43KB; it will not work with 64-bit Windows! Put it somewhere like C:\spark\bin\
.
Step 4: Create c:\tmp\hive
and open up permissions to everybody.
Step 5: Create environment variables:
SPARK_HOME
>> C:\spark
HADOOP_HOME
>> (where winutils is)
JAVA_HOME
>> (where you installed Java)
PATH
>> ;%SPARK_HOME%\bin; %JAVA_HOME%\bin;
Step 6: Open the conf
folder and create and modify log4j.properties
.
Step 7: In the bin folder, run spark-shell.cmd
. Type Ctrl-D
to exit the shell.
Spark supports Scala, Python, and Java as primary languages and R and SQL as secondaries. We will use Scala because:
If you prefer Python or Java, that’s fine.
Relevant functional programming concepts:
rdd1.distinct()
rdd1.union(rdd2)
rdd1.intersection(rdd2)
rdd1.subtract(rdd2)
– Akin to the EXCEPT
operator in SQLrdd1.cartesian(rdd2)
– Cartesian product (CROSS JOIN
in SQL)Warning: set operations can be slow in Spark depending on data sizes and whether data needs to be shuffled across nodes.
We will analyze food service inspection data for the city of Durham. We want to answer a number of questions about this data, including average scores and splits between classic restaurants and food trucks.
One of the first additions to Spark was SQL support, first with Shark and then with Spark SQL.
With Apache Spark 2.0, Spark SQL can take advantage of Datasets (strongly typed RDDs) and DataFrames (Datasets with named columns).
Spark SQL functions are accessible within the SparkSession object, created by default as “spark” in the Spark shell.
Functions provide us with SQL-like operators which we can chain together in Scala, similar to how we can use LINQ with C#. These functions include (but are not limited to) select()
, distinct()
, where()
, join()
, and groupBy()
.
There are also functions you might see in SQL Server like concat()
, concat_ws()
, min()
, max()
, row_number()
, rank()
, and dense_rank()
.
Queries are exactly as they sound: we can write SQL queries. Spark SQL strives to be ANSI compliant with additional functionality like sampling and user-defined aggregate functions.
Spark SQL tends to lag a bit behind Hive, which lags a bit behind the major relational players in terms of ANSI compliance. That said, Spark SQL has improved greatly since version 1.0.
GroupLens Research has made available their MovieLens data set which includes 20 million ratings of 27K movies.
We will use Apache Spark with Spark SQL to analyze this data set, letting us look at frequently rated movies, the highest (and lowest) rated movies, and common movie genres.
Databricks, the commercial enterprise behind Apache Spark, makes available the Databricks Unified Analytics Platform in AWS and Azure. They also have a Community Edition, available for free.
Clusters are 1 node and 15 GB RAM running on spot instances of AWS.
Data sticks around after a cluster goes away, and limited data storage is free.
Zeppelin comes with a good set of built-in, interactive plotting options.
Your cluster terminates after 2 hours of inactivity. You can also terminate the cluster early.
Microsoft has official support for Spark running on .NET. They support the C# and F# languages.
With .NET code, you are limited to DataFrames and Spark SQL, so no direct access to RDDs.
We've only scratched the surface of Apache Spark. From here, check out:
To learn more, go here:
https://csmore.info/on/spark
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact