Azure Synapse Analytics is Microsoft's platform for modern data warehousing. It is made up of four data sources:
Azure Synapse Analytics dedicated SQL pools, nee Azure SQL Data Warehouse, offer up a Massive Parallel Processing approach to data warehousing and work best in classic data warehousing scenarios:
Azure Synapse Analytics Spark pools allow you to spin up Apache Spark clusters. Key use cases for these Spark clusters include:
Azure Synapse Analytics serverless SQL pools are a bit different from dedicated SQL pools in the following ways:
Data Explorer pools allow you to perform real-time analysis on large volumes of data. The initial use case of this was to process log data, but using the Kusto Query Language (KQL), we can also perform detailed time series analysis, whether real-time or off of already-stored data.
NOTE: because Data Explorer pools are still in public preview, we won't get into the security aspects of them yet, as guidance is not yet solidified.
Synapse Pipelines are based on Azure Data Factory pipelines, making it easy to perform Extract-Load-Transform (ELT) operations.
Synapse Link allows us to connect data sources like Cosmos DB, Dataverse, and SQL Server 2022 to Azure Synapse Analytics, making it easier to move data without needing to build ETL or ELT jobs.
Organizations may be tied to a variety of regulatory regimes. Review Microsoft compliance offerings to see more detail.
Microsoft has details on compliance but this is a team effort.
One key element of regulatory compliance is that we follow certain data collection principles.
Microsoft controls most of High Availability.
Disaster Recovery is a joint effort.
There are two major options around network security: managed virtual networks and data exfiltration protection.
Microsoft creates a virtual network and controls it for you.
You must enable this at creation time if you wish to have a managed virtual network. You cannot enable this on an existing workspace or disable it later.
NOTE: there is no option for you to create your own virtual network and manage it yourself. Either you use Microsoft's managed VNet or you have no VNet for Synapse.
Data exfiltration protecton "allows outbound data traffic only to approved targets."
Short version of my opinion: it works well but can be quite limiting to developers.
You must enable this at creation time if you wish to enable DEP. You cannot enable this on an existing workspace or disable it later.
You will want to grant most users Storage Blob Data Contributor against the storage account in Azure.
Azure Synapse Analytics has several RBAC roles. These are "workspace" level roles and control what actions users may take within Synapse. In addition to these, there are different RBAC roles for the dedicated and serverless SQL pools.
All traffic to workspace endpoints, whether you are using a Managed Virtual Network or not, is encrypted using TLS 1.2. There is nothing you need to do to enable this.
If you're familiar with SQL Server, you should already know about dedicated SQL pool role-based access controls:
You can control access to objects (tables, views, procedures, etc.) for various roles using three permissions:
You can control access to objects (tables, views, procedures, etc.) for various roles using three permissions:
REVOKE
resets a GRANT
or DENY
.DENY
supercedes GRANT
, irrespective of how "far" the denial was in the inheritence chain. Even if we explicitly grant on the user.Suzanne
is a user in the Developers
Azure AD group.Developers
has db_datareader
on the user database.Developers
has DENY SELECT
on dbo.Finances
.Suzanne
can NOT access dbo.Finances
.GRANT
SELECT
for Suzanne
, she still cannot access the table! DENY
overrides GRANT
, regardless of level.User database roles are basically the same as SQL Server. There are three interesting master database roles:
Dedicated SQL pools are always encrypted at rest using Microsoft keys.
In addition, you may double-encrypt your workspace using a customer-managed key. This will cause dedicated SQL pool data to be double-encrypted.
If you do not double-encrypt your workspace, you may turn on Transparent Data Encryption for dedicated SQL pools.
All snapshots (backups) are written to encrypted disks.
Dedicated SQL pools allow you to perform column-level encryption using the EncryptByKey()
and DecryptByKey
functions.
If you choose to use column-level encryption, use AES_256
for the symmetric key. Azure Synapse Analytics also supports AES_128
and AES_192
but none of the older, insecure options like DES or RC4.
Always Encrypted is not supported.
Dynamic data masking is NOT an alternative to encryption!
It replaces results in a SELECT
operation with a mask, blocking out sensitive fields like identifiers or e-mail addresses.
People with the ability to query tables can still use the WHERE
clause to filter on values and clever users can tease out the actual values.
Row-level security allows you to use group membership or execution context to control which users appear--these are "filter predicates."
Assuming you have a column in the table you can use to filter, create a function which returns 1 if the current user can see the row. Create a security policy to tie your single column to that function.
This WILL have a performance impact!
Data Discovery and Classification allows you to classify specific columns in tables with various sensitivity levels.
We may then audit queries on classified data.
Database objects follow the same basic security rules as with dedicated SQL pools:
The serverless SQL pool is (mostly) read-only, so we cannot insert, update, delete, merge, truncate, or otherwise modify data.
We cannot create (normal) tables in the serverless SQL pool. Either we access data via OPENROWSET
or create external tables with PolyBase.
The OPENROWSET
command requires (at least) the Synapse User RBAC role as well as GRANT ADMINISTER DATABASE BULK OPERATIONS
in the serverless SQL pool.
This sounds like a scary role but it's not, at lesat in the serverless SQL pool. We can't bulk insert data, so it's really just for OPENROWSET
queries.
PolyBase requires (at least) the Synapse User RBAC role.
To create or drop external tables, we need CONTROL
permission on the serverless SQL pool. That's powerful.
To read from exteranl tables, we need db_datareader
/db_owner
or be granted explicit SELECT
access to the table.
The serverless pool reads data from your data lake. Use the Managed Service Identity to access the lake.
If you need to access external storage accounts, one option is to use Shared Access Signature (SAS) tokens.
SAS tokens grant limited access to specific storage account resources, including filtering down to specific containers/folders/files and granting specific permissions (e.g., List, Read, Write Delete). You can grant access for a limited timeframe, limit allowed IP addresses, and require HTTPS using SAS tokens.
Spark pools offer two types of isolation:
To create a Spark pool, we need one of the following roles:
To run a Spark notebook, we need one of the following roles:
If we want to use the data lake, we should also have the Storage Blob Data Contributor Azure role on the storage account.
Lake Databases share metadata between the serverless SQL pool and Spark pools. Lake databases are hosted on top of the data lake.
We secure lake databases at the storage level: to grant rights, give read on files and execute on folders.
The database creator is the owner and has all rights. In addition, accounts with the Synapse Administrator or Synapse SQL Administrator will have all permissions as well.
Spark pools come with a variety of pre-loaded libraries but you can also install Scala/Java and Python libraries after the fact.
Synapse does not support Maven or other Java package managers so we will need to install workspace packages as JAR files from the Manage menu in the Synapse workspace.
Synapse installs Python packages via Pip and allows you to specify a Requirements.txt file. We can also import custom WHL files from the Manage menu in the Synapse workspace.
To import libraries, we will need Storage Data Blob Contributor on the data lake but no additional permissions.
Custom package installation can introduce vulnerable packages.
Use a custom Conda channel to control and version packages. Use "versionless" package specifications in requirements.txt to prevent getting outdated versions of libraries.
Note that with data loss protection enabled, you will not be able to access external websites, including Conda channels.
Spark pools allow us to query Key Vault for access to resources like SAS tokens, connection strings, etc.
DO NOT STORE SECRETS OUTSIDE OF KEY VAULT!
Use the TokenLibrary to access these secrets. To ensure it has access, the Synapse MSI should have Secret Get privileges on the Key Vault in question.
Microsoft Defender for SQL includes two key components:
Azure Policy helps enforce organizational standards and ensure compliance by defining business rules in JSON and applying them to Azure resources.
Policies can help you audit configuration drift, log resource changes, alter resources before a change, or even deny resource changes if particular circumstances are not met.
There are several built-in policy definitions for Azure Synapse Analytics.
Over the course of this talk, we have looked at ways to secure an Azure Synapse Analytics workspace. For a full, nearly 8-hour training, check out the EC-Council CodeRed course Securing a Data Warehouse in Azure Synapse Analytics.
To learn more, go here:
https://csmore.info/on/lockdown
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact