The goal of this talk is to cover Microsoft's technologies for data virtualization in SQL Server 2022, including PolyBase.
We will look at installation, configuration, and a number of practical business scenarios involving the product.
PolyBase is Microsoft's technology for covering three two use cases:
PolyBase V1 supported two external data sources:
PolyBase V2 includes V1 data sources as well as ODBC-based integrations:
Note that the generic ODBC connector is Windows-only.
SQL Server 2022 eliminates V1 support.
In return, it introduces (or re-introduces):
SQL Server is a classic "scale-up" technology: if you want more power, add more RAM/CPUs/resources to the single server.
Hadoop is a great example of an MPP system: if you want more power, add more servers; the system will coordinate processing.
Prior to SQL Server 2022, PolyBase offered a feature known as Scale-Out Clusters, in which you could perform PolyBase commands against multiple SQL Server instances. It looked to be the first phase in support for Massive Parallel Processing in on-premises SQL Server.
It is no longer available in SQL Server 2022. RIP.
In order to use the generic ODBC connector, you need to install an appropriate ODBC driver on the machine hosting SQL Server. Grab the right driver and follow the normal install steps.
Three sorts of external objects exist.
An external data source tells SQL Server where it can find remote data.
That data does not migrate to SQL Server! It lives on the external data source.
An external file format tells SQL Server what type of file you intend to use. SQL Server supports delimited files (e.g., comma or tab separated), ORC, and Parquet formats.
This is not required for V2 data sources, as they do not read from files. It is required for Blob Storage/Data Lake Storage and S3.
An external table tells SQL Server the structure of your external resource. SQL Server requires structured data and will reject records which do not fit the data types and sizes you set.
Going from unstructured to structured data is a risk: conversions may fail. When that happens, the PolyBase engine rejects that row. After we meet a rejection threshold, PolyBase fails the query.
The three parameters we control are:
REJECT_TYPE
: { VALUE
, PERCENTAGE
}
REJECT_VALUE
REJECT_SAMPLE_VALUE
= Number of rows to pull in before recalculating rejection percentage.
Assumptions:
Technique: write old data to Azure Blob Storage / Hadoop and use partitioned views to join back together.
Benefits:
Assumptions:
Technique: instead of Extract-Transform-Load, use PolyBase and the Extract-Load-Transform (ELT) pattern to land and transform data.
Benefits:
Assumptions:
Technique: use PolyBase external data sources and external tables to create SQL Server representations of data sets.
Benefits:
SQL Server 2022 offers an update to data virtualization, retaining ODBC-based PolyBase and adding an API-based virtualization process.
Data virtualization in SQL Server has a number of useful business cases. It won't give you the greatest performance, but if you want a unified SQL interface and performance is not a top consideration, it does the job.
To learn more, go here:
https://csmore.info/on/polybase
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact