Catallaxy Services | Much Ado About Hadoop

ABSTRACT

If you're interested in Hadoop but don't know where to begin, this session will give you an idea of what you can do with the open-source platform. We will see an overview of the Hadoop architecture, becoming familiar with the overall platform and its solutions for warehousing, ETL, streaming data ingest, in-memory processing, and more. We will compare Hadoop to SQL Server to help gain an understanding of when to deploy which technology.

ADDITIONAL MEDIA

I have a version of this talk on YouTube. You can get the recording on my Youtube channel.

SLIDES

Click here to access the slides for this presentation.

The slides are licensed under Creative Commons Attribution-ShareAlike.

DEMO CODE

No demos are available for this presentation.

LINKS & FURTHER INFO

Hadoop Distributions

If you want to get started with Hadoop, there are a number of options available to you. The local sandboxes tend to be available as Azure or AWS virtual machines as well, so if you don't have a beefy machine at home, you can still get started pretty easily.

Local sandboxes:

Hortonworks Sandbox. This is probably the easiest way to get started with Hadoop. They give you a full VM which is already installed and configured with a number of tools. The Ambari user interface is pretty nice, and if you are on the .NET stack, Hortonworks tends to present a nicer experience.
Cloudera QuickStart VM. Like the Hortonworks Sandbox, Cloudera's offering is a fully-featured single-node VM, as well as a Docker image.
MapR Sandbox For Hadoop.

Platform-as-a-Service offerings:

Azure HDInsight. It's fairly pricey--data nodes can run you a couple hundred dollars per month apiece at the low end and upwards of $2K a month per node on the high end. If you want a fairly simple Platform-as-a-Service Hadoop experience, HDInsight is a good option, as there are good tools available for developers. There are limitations which prevent integration with services like Polybase, and the common answer to integration questions tends to be "move your output data to Azure Blob Storage and access it from there."
Elastic MapReduce. Amazon's offering ties to S3 and is less expensive than HDInsight. The marginal cost for ElasticMapReduce is low, though also factor in the EC2 costs and it's no longer pennies per hour. If your company is integrated with Amazon already, this is a good service.

Interesting Links

A listing of the Hadoop ecosystem. It's huge. Fortunately, we don't have to know about all of these technologies!
Pokemon or Big Data? Some of these are pretty close (like Horsea versus Seahorse).
I make use of an excellent image from the MapReduceFoundation at UniversitÃ¤t Passau. They have a few academic papers on the topic as well.
An interesting comparison of Kafka versus Flink, written by one of the primary developers of each.
A comparison of Spark Streaming versus Kafka versus Flink.
There are a number of resources comparing different Hadoop services in terms of market size and effectiveness. This particular study looks at a half-dozen vendors, including Amazon and Microsoft's PaaS offerings. The Cloudera market share number they give is a much larger number than I've seen elsewhere---typically I see Cloudera and Hortonworks neck-and-neck.

Learning Resources

Books are hard to recommend because the source material changes so frequently--a book written in 2017 can be out of date by the time it's published in 2018. These are a few books that I have on my to-read list:

Spark in Action, 2nd Edition
Kafka in Action
Kafka Streams in Action
Hadoop: the Definitive Guide. Be warned that this is quite old at this point, having been released in 2015.
Modern Big Data Processing with Hadoop

Some of the foundational papers do hold up well, as they provide information on the underpinnings of these technologies. Examples include:

I have a few other talks in which I cover elements of Hadoop in detail.

Getting Started with Apache Spark. This was written for developers (including SQL developers) and covers the basics of Apache Spark.
Peanut Butter and Chocolate: Integrating Hadoop and SQL Server. This was written for .NET and SQL Server developers and covers several techniques for moving data between Hadoop and SQL Server.
Polybase In Action. This was written for SQL Server developers to show how to use Polybase to integrate with Hadoop and Azure Blob Storage.
Using Kafka for Real-Time Data Ingestion with .NET. This was written for .NET developers to learn how to use Kafka, a distributed message broker.

I learned a good deal from the Hortonworks tutorials, which include both written and video tutorials. They are a good place to start.