Catallaxy Services | Locking Down Azure Synapse Analytics

Locking Down Azure Synapse Analytics

Kevin Feasel (@feaselkl)
https://csmore.info/on/lockdown

Who Am I? What Am I Doing Here?

	Catallaxy Services
	Curated SQL
	Training on Teachable

@feaselkl

Motivation

Provide a high-level overview of Azure Synapse Analytics security, including:
- Network security options
- Workspace security
- Connecting to outside resources
- Securing three major pools: dedicated SQL pool, serverless SQL pool, Spark pool
Provide notes on additional tooling.

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Setup Options
Workspace Security
Linked Services
Pool Security
Additional Tools

Azure Synapse Analytics

Azure Synapse Analytics is Microsoft's platform for modern data warehousing. It is made up of four data sources:

Dedicated SQL pools
Spark pools
Serverless SQL pool
Data explorer pools (preview)

Dedicated SQL pools

Azure Synapse Analytics dedicated SQL pools, nee Azure SQL Data Warehouse, offer up a Massive Parallel Processing approach to data warehousing and work best in classic data warehousing scenarios:

You are using the Kimball model of facts and dimensions.
Your total data size is at least 1 TB.
Your major fact tables have at least 1 billion rows of date and numeric data.
Your major dimension tables are relatively small but wide.
Users typically work off of a set number of queries.

Spark pools

Azure Synapse Analytics Spark pools allow you to spin up Apache Spark clusters. Key use cases for these Spark clusters include:

Complicated data transformation projects, such as working with regular expressions.
Distributed machine learning tasks not suited for Azure Machine Learning.
Migrating work from on-premises Apache spark clusters or HDInsight clusters.

Serverless vs Dedicated SQL pools

Azure Synapse Analytics serverless SQL pools are a bit different from dedicated SQL pools in the following ways:

Serverless SQL pools do not store data. They read files directly from your data lake and expose them as SQL tables.
Serverless SQL pools require no infrastructure or resource reservation and have no up-front costs.
The pricing model for serverless SQL pools is based on data utilization: approximately $5 per terabyte of data processed.

Data Explorer pools

Data Explorer pools allow you to perform real-time analysis on large volumes of data. The initial use case of this was to process log data, but using the Kusto Query Language (KQL), we can also perform detailed time series analysis, whether real-time or off of already-stored data.

NOTE: because Data Explorer pools are still in public preview, we won't get into the security aspects of them yet, as guidance is not yet solidified.

Brief Pool Advice

Must have one and only one severless SQL pool.
One Spark pool per relevant task.
Create and destroy Spark pools as needed. Typically have zero or several, not just one.
One dedicated SQL pool per warehouse. Typically only need one.
Pause dedicated SQL pools when not in use to save money.

Synapse Pipelines

Synapse Pipelines are based on Azure Data Factory pipelines, making it easy to perform Extract-Load-Transform (ELT) operations.

Synapse Link

Synapse Link allows us to connect data sources like Cosmos DB, Dataverse, and SQL Server 2022 to Azure Synapse Analytics, making it easier to move data without needing to build ETL or ELT jobs.

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Workspace Security
Linked Services
Pool Security
Additional Tools

Regulatory Compliance

Organizations may be tied to a variety of regulatory regimes. Review Microsoft compliance offerings to see more detail.

Shared Responsibility

Microsoft has details on compliance but this is a team effort.

Microsoft controls the hardware and you control the access.
Microsoft offers encryption and you control encryption.
Microsoft controls the network and you control the data.

Data Collection Principles

One key element of regulatory compliance is that we follow certain data collection principles.

Encrypt data at rest
Encrypt data in transit
Ensure sensitive data is necessary to store
Remove obsolete data
Grant access on an as-needed basis

Business Continuity - HA/DR

Microsoft controls most of High Availability.

Dedicated SQL pools are distributed and highly available by default.
The Serverless SQL pool is highly available, though the service can fail.
Spark pools have "user-controllable" high availability: spin up a new pool.

Business Continuity - HA/DR

Disaster Recovery is a joint effort.

Dedicated SQL pools take automated snapshots with RPO 8 hours and retention 7 days. You may also take manual snapshots.
Dedicated SQL pools are automatically backed up to another region once every 24 hours.
Other pools store data in a data lake. Geo-Redundant Storage makes Azure Data Lake Gen2 disaster recoverable.

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Setup Options
Workspace Security
Linked Services
Pool Security
Additional Tools

Network Security Options

There are two major options around network security: managed virtual networks and data exfiltration protection.

Managed Virtual Network

Microsoft creates a virtual network and controls it for you.

You must enable this at creation time if you wish to have a managed virtual network. You cannot enable this on an existing workspace or disable it later.

NOTE: there is no option for you to create your own virtual network and manage it yourself. Either you use Microsoft's managed VNet or you have no VNet for Synapse.

What the Managed Virtual Network Does

Segregates Spark pools into individual subnets. This provides cross-pool protection.
Isolates the Synapse workspace into its own virtual network.
Opens up only the ports necessary for communication with Synapse services and no more.
Enables managed private endpoints: connection points to Azure resources which do not transit the public Internet.
Creates managed private endpoints to the serverless SQL pool and any dedicated SQL pools you create.
Allows you to enable data exfiltration protection (DEP).

Data Exfiltration Protection

Data exfiltration protecton "allows outbound data traffic only to approved targets."

Short version of my opinion: it works well but can be quite limiting to developers.

You must enable this at creation time if you wish to enable DEP. You cannot enable this on an existing workspace or disable it later.

What Data Exfiltration Protection Does

Severely limits external resources your workspace can reach: (mostly) just managed private endpoints.
Allows you to limit Azure AD tenants to which you can write data.
Prevents pools from hitting the public internet.

Network Options You Control

(Managed VNet only) If you wish to disable public network access
IP address ranges which can access Synapse resources
If all Azure resources get access to this workspace

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Setup Options
Workspace Security
Linked Services
Pool Security
Additional Tools

I'm Contributing!

You will want to grant most users Storage Blob Data Contributor against the storage account in Azure.

Synapse RBAC Roles

Azure Synapse Analytics has several RBAC roles. These are "workspace" level roles and control what actions users may take within Synapse. In addition to these, there are different RBAC roles for the dedicated and serverless SQL pools.

Synapse Administrator

Full access to everything inside Synapse.
Along with Azure Owners, can assign RBAC roles to others.
Still need Azure permissions (Owner or Contributor of the workspace) for compute resource management.

Spark Administrator

Create and manage Spark pools.
Can work with Spark notebooks and job definitions.
Can NOT run Synapse Pipelines or grant access to others.

SQL Administrator

Full access to the serverless SQL pool.
Can manage dedicated SQL pools, including creating logins and granting access inside the SQL pools.
Can manage scripts, credentials, and linked services.
Can NOT run Synapse Pipelines or grant Synapse access to others.

Artifact Publisher

Can read and manage published code artifacts, as well as their outputs.
Does not include permissions to run code or pipelines, or to grant access to others.
Artifact User also exists and provides read-only access but no management options.

Credential User

Use secrets within credentials and linked services.
Required for activities like pipeline runs.

Compute Operator

Can submit Spark jobs/notebooks and view logs.
May cancel Spark jobs from any user.
May run pipelines and view pipeline runs/outputs if given Credential User access.

Monitoring Operator

Can read published code artifacts and outputs for notebooks and pipeline runs.
May view serverless SQL pool, Spark pool, Data Explorer pool, and Integration Runtime details.
May not run/cancel pipelines or Spark jobs.

Synapse Contributor

Includes all Compute Operator permissions.
Full access to Spark pools and Integration Runtimes.
May work with published code artifacts.
Requires Credential User to run pipelines.

Synapse User

List and view details on SQL pools, Spark pools, Integration Runtimes, and published linked services + credentials.
May not review published code artifacts.
May create new artifacts but requires other permissions to run or publish them.

Encrypting Data in Transit

All traffic to workspace endpoints, whether you are using a Managed Virtual Network or not, is encrypted using TLS 1.2. There is nothing you need to do to enable this.

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Setup Options
Workspace Security
Linked Services
Pool Security
Additional Tools

Linked Services

Make connections to external resources
Can include on-prem resources or services in other clouds
For Azure services, can connect via Managed Identity
Store keys and connection strings in Key Vault

Private Endpoints

Requires Managed Virtual Network
Most secure way of connecting to other Azure resources
Works with a limited number of Azure resources
Use these whenever possible!

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Setup Options
Workspace Security
Linked Services
Pool Security
Additional Tools

Dedicated SQL Pool

Role-Based Access Controls

If you're familiar with SQL Server, you should already know about dedicated SQL pool role-based access controls:

Logins (Azure AD or SQL authentication)
Users
Built-In Database Roles
Custom Database Roles

Permissions

You can control access to objects (tables, views, procedures, etc.) for various roles using three permissions:

GRANT
REVOKE
DENY

Notes on Permissions

You can control access to objects (tables, views, procedures, etc.) for various roles using three permissions:

Permissions are inherited. If a user is in a role, the user gets all of the grants and denials from the role.
REVOKE resets a GRANT or DENY.
DENY supercedes GRANT, irrespective of how "far" the denial was in the inheritence chain. Even if we explicitly grant on the user.

Permissions Example

Suzanne is a user in the Developers Azure AD group.
Developers has db_datareader on the user database.
Developers has DENY SELECT on dbo.Finances.
Suzanne can NOT access dbo.Finances.
If we GRANT SELECT for Suzanne, she still cannot access the table! DENY overrides GRANT, regardless of level.

Database Roles

User database roles are basically the same as SQL Server. There are three interesting master database roles:

dbmanager -- Can create and drop databases
db_exporter -- Can create tables, alter schemas, and work with PolyBase
loginmanager -- Can create and delete logins in master

Encrypting Data at Rest

Dedicated SQL pools are always encrypted at rest using Microsoft keys.

In addition, you may double-encrypt your workspace using a customer-managed key. This will cause dedicated SQL pool data to be double-encrypted.

If you do not double-encrypt your workspace, you may turn on Transparent Data Encryption for dedicated SQL pools.

All snapshots (backups) are written to encrypted disks.

Column-Level Encryption

Dedicated SQL pools allow you to perform column-level encryption using the EncryptByKey() and DecryptByKey functions.

If you choose to use column-level encryption, use AES_256 for the symmetric key. Azure Synapse Analytics also supports AES_128 and AES_192 but none of the older, insecure options like DES or RC4.

Always Encrypted is not supported.

Dynamic Data Masking

Dynamic data masking is NOT an alternative to encryption!

It replaces results in a SELECT operation with a mask, blocking out sensitive fields like identifiers or e-mail addresses.

People with the ability to query tables can still use the WHERE clause to filter on values and clever users can tease out the actual values.

Row-Level Security

Row-level security allows you to use group membership or execution context to control which users appear--these are "filter predicates."

Assuming you have a column in the table you can use to filter, create a function which returns 1 if the current user can see the row. Create a security policy to tie your single column to that function.

This WILL have a performance impact!

Data Discovery and Classification

Data Discovery and Classification allows you to classify specific columns in tables with various sensitivity levels.

Data Discovery and Classification

We may then audit queries on classified data.

Serverless SQL Pool

Role-Based Access Controls

Database objects follow the same basic security rules as with dedicated SQL pools:

Logins and users
GRANT, DENY, REVOKE permissions
Role assignment, including custom roles

Less to Do

The serverless SQL pool is (mostly) read-only, so we cannot insert, update, delete, merge, truncate, or otherwise modify data.

We cannot create (normal) tables in the serverless SQL pool. Either we access data via OPENROWSET or create external tables with PolyBase.

OPENROWSET

The OPENROWSET command requires (at least) the Synapse User RBAC role as well as GRANT ADMINISTER DATABASE BULK OPERATIONS in the serverless SQL pool.

This sounds like a scary role but it's not, at lesat in the serverless SQL pool. We can't bulk insert data, so it's really just for OPENROWSET queries.

PolyBase

PolyBase requires (at least) the Synapse User RBAC role.

To create or drop external tables, we need CONTROL permission on the serverless SQL pool. That's powerful.

To read from exteranl tables, we need db_datareader/db_owner or be granted explicit SELECT access to the table.

Recommendation: MSI

The serverless pool reads data from your data lake. Use the Managed Service Identity to access the lake.

Fallback: SAS Tokens

If you need to access external storage accounts, one option is to use Shared Access Signature (SAS) tokens.

SAS tokens grant limited access to specific storage account resources, including filtering down to specific containers/folders/files and granting specific permissions (e.g., List, Read, Write Delete). You can grant access for a limited timeframe, limit allowed IP addresses, and require HTTPS using SAS tokens.

Fallback: SAS Tokens

Spark Pool

Spark Pool Isolation

Spark pools offer two types of isolation:

Network isolation -- each Spark pool is on its own virtual network
Physical isolation -- IF you use the XXXLarge node size in specific regions, you may get physical isolation of compute--nobody else uses "your" server.

Creating a Spark Pool

To create a Spark pool, we need one of the following roles:

Synapse Administrator
Synapse Spark Administrator

Running a Spark Notebook

To run a Spark notebook, we need one of the following roles:

Synapse Administrator
Synapse Contributor
Synapse Compute Operator

If we want to use the data lake, we should also have the Storage Blob Data Contributor Azure role on the storage account.

Lake Databases

Lake Databases share metadata between the serverless SQL pool and Spark pools. Lake databases are hosted on top of the data lake.

We secure lake databases at the storage level: to grant rights, give read on files and execute on folders.

The database creator is the owner and has all rights. In addition, accounts with the Synapse Administrator or Synapse SQL Administrator will have all permissions as well.

Spark Libraries

Spark pools come with a variety of pre-loaded libraries but you can also install Scala/Java and Python libraries after the fact.

Synapse does not support Maven or other Java package managers so we will need to install workspace packages as JAR files from the Manage menu in the Synapse workspace.

Spark Libraries

Synapse installs Python packages via Pip and allows you to specify a Requirements.txt file. We can also import custom WHL files from the Manage menu in the Synapse workspace.

To import libraries, we will need Storage Data Blob Contributor on the data lake but no additional permissions.

Mitigating Library Risks

Custom package installation can introduce vulnerable packages.

Use a custom Conda channel to control and version packages. Use "versionless" package specifications in requirements.txt to prevent getting outdated versions of libraries.

Note that with data loss protection enabled, you will not be able to access external websites, including Conda channels.

Working with SAS Tokens

Spark pools allow us to query Key Vault for access to resources like SAS tokens, connection strings, etc.

DO NOT STORE SECRETS OUTSIDE OF KEY VAULT!

Use the TokenLibrary to access these secrets. To ensure it has access, the Synapse MSI should have Secret Get privileges on the Key Vault in question.

Agenda

What is Azure Synapse Analytics?
Before You Begin...
Setup Options
Workspace Security
Linked Services
Pool Security
Additional Tools

Microsoft Defender for SQL

Microsoft Defender for SQL includes two key components:

SQL Advanced Threat Protection - Provides alerting on suspicious events like log on from an unusual location, a suspected brute force attack, or an unusual location for data exports
SQL Vulnerability Assessment - Automated scanning service built into Azure which sends a weekly e-mail containing potential configuration issues

Azure Policy

Azure Policy helps enforce organizational standards and ensure compliance by defining business rules in JSON and applying them to Azure resources.

Policies can help you audit configuration drift, log resource changes, alter resources before a change, or even deny resource changes if particular circumstances are not met.

Azure Policy

There are several built-in policy definitions for Azure Synapse Analytics.

Wrapping Up

Over the course of this talk, we have looked at ways to secure an Azure Synapse Analytics workspace. For a full, nearly 8-hour training, check out the EC-Council CodeRed course Securing a Data Warehouse in Azure Synapse Analytics.

Wrapping Up

To learn more, go here:
https://csmore.info/on/lockdown

And for help, contact me:
feasel@catallaxyservices.com | @feaselkl

Catallaxy Services consulting:
https://CSmore.info/on/contact