Catallaxy Services | @feaselkl |
|
Curated SQL | ||
We Speak Linux |
To get the most of this, I will assume the following:
That the Division of Labour is Limited by the Extent of the Market
As it is the power of exchanging that gives occasion to the division of labour, so the extent of this division must always be limited by the extent of that power, or, in other words, by the extent of the market. When the market is very small, no person can have any encouragement to dedicate himself entirely to one employment, for want of the power to exchange all that surplus part of the produce of his own labour, which is over and above his own consumption, for such parts of the produce of other men’s labour as he has occasion for.
- Adam Smith, The Wealth of Nations, Book I, section III
Specialization and division of labor. Works for people, works for data.
SQL Server as a relational engine:
Specialization and division of labor. Works for people, works for data.
Hadoop as a processing engine:
Ambari is a framework for monitoring and administering a Hadoop cluster.
It includes tools to maintain HDFS clusters, Hive tables, and much more.
Use Ambari for one-off, occasional work. This is the easiest method, but also the most time-consuming.
Scroll down for image-heavy demo.
Open HDFS Files.
Go to the tmp folder.
Create a new directory.
Set folder permissions.
Set folder permissions. For production, don't indiscriminately do this, of course...
Upload a file.
Upload a file.
Delete a file.
For Hive work, copy BattingRatings.csv to /tmp/ootp. Make sure that /tmp/ootp and the file both have write turned on.
Open up Hive.
Create the BattingRatings table.
Select top 100 rows from the BattingRatings table.
BattingRatings table results.
Write results to output to load into SQL Server.
Write Hive query to file. Run the BattingRatingsPerm script first.
The output is LOCAL files, not HDFS. We specified that in the query.
Our output file is in /tmp/ootpoutput.
Sqoop is a quick-and-easy console program designed to ingest data from database servers (like SQL Server) into Hadoop and also push data back to database servers.
Pre-req: install the Microsoft JDBC driver.
Sqoop is good for loading entire tables/databases into Hadoop and loading staging tables into SQL Server.
Scroll down for image-heavy demo.
Create the SecondBasemen table.
Grab the Microsoft JDBC driver and put into /usr/hadoop/[version]/sqoop/lib.
sqoop list-databases will list databases
sqoop list-tables will list tables for the default schema
Other important options:
Check out SqoopCommands.txt for examples on how to use these commands and more.
If you install the Microsoft.Hadoop.MapReduce NuGet package, you can connect to a Hadoop cluster and perform file maintenance.
Our example will use FSharp.Data.SqlClient to insert rows into SQL Server.
Use .NET when you have complex and custom data flows, or if most of your ETL infrastructure is in .NET.
Check out HDFSFileManagement.FSharp for the code.
Visual Studio + SQL Server Data Tools 2016 leads to good Hadoop integration.
SSDT 2016 includes HDFS file source and destination via WebHDFS or WebHCat.
If you're using older versions of VS or SSDT, you'll have limited access.
Use SSIS for Hadoop ETL if this is what you normally use for SQL Server ETL.
Check out HadoopIntegration.SSIS for the code.
Linked servers let us connect external resources to SQL Server and query them like SQL tables.
Although the most common use case for linked servers is connecting other SQL Server instances, we can also tie in Excel, flat files, and even Hive tables.
Use linked servers to connect Hive and SQL tables together in queries, particularly before SQL Server 2016.
Scroll down for image-heavy demo.
Grab the Microsoft Hive ODBC driver. Then, create a System DSN.
Add a new DSN using the Hive ODBC driver.
Configure the Hive DSN.
Make sure that "Default string column" is no more than 8000 bytes.
Polybase is Microsoft's latest Hadoop integration offering. First introduced in SQL Server 2012 PDW (now APS), it is now generally available in SQL Server 2016.
Polybase has several advantages over linked servers:
Polybase is not perfect:
Use Polybase to join SQL tables to Hadoop data sets, particularly if you want to write T-SQL to query the data.
Method | Good | Bad | Best Uses |
Ambari | Easy to use | Manual processing | One-off moves |
Sqoop | Easy to use; easy to automate | Cannot create SQL tables; limited query support | Staging table ETL; migrating into Hadoop |
.NET | Powerful | Harder to write; finicky | Custom processing |
SSIS | Good ETL patterns; integrates nicely | HDFS objects not (yet) complete; joining data is slow | Established ETL program |
Linked Server | T-SQL syntax; LS common | Needs ODBC driver; can be slow; just Hive | Joining SQL data with Hive data |
Polybase | T-SQL; Perf; Intellisense | Only MapReduce | Joining SQL data with HDFS data |
There are plenty of ways to integrate Hadoop with SQL Server. Mix and match methods as they suit you needs.
To learn more, go here: http://CSmore.info/on/hadoopsql
And for help, contact me: feasel@catallaxyservices.com | @feaselkl