Skip to content

Commit

Permalink
Merge pull request #105 from databricks/main
Browse files Browse the repository at this point in the history
Update to previous PR
  • Loading branch information
AleksCallebat authored Oct 15, 2024
2 parents 3c39f43 + 94e8e6d commit 88e1dbf
Show file tree
Hide file tree
Showing 128 changed files with 4,919 additions and 647 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# Security Reference Architectures (SRA) - Terraform Templates
<p align="center">
<img src="https://i.ibb.co/NrfH2qc/Screenshot-2024-09-17-at-1-02-06-PM.png" />
</p>

## Project Overview

Security Reference Architecture (SRA) with Terraform templates makes deploying workspaces with Security Best Practices easy. You can programmatically deploy workspaces and the required cloud infrastructure using the official Databricks Terraform provider. These unified Terraform templates are pre-configured with hardened security settings similar to our most security-conscious customers. The initial templates based on [Databricks Security Best Practices](https://www.databricks.com/trust/security-features#best-practices)

- [AWS](https://github.com/databricks/terraform-databricks-sra/tree/main/aws)
- [AWS Govcloud](https://github.com/databricks/terraform-databricks-sra/tree/main/aws-gov)
- [Azure](https://github.com/databricks/terraform-databricks-sra/tree/main/azure)
- [GCP](https://github.com/databricks/terraform-databricks-sra/tree/main/gcp)

Expand Down
43 changes: 43 additions & 0 deletions aws-gov/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Local .terraform directories
*/.terraform/*
*/.terraform
.terraform.lock.hcl

# .tfstate files
*.tfstate
*.tfstate.*

# environment file
aws/example.tvars

# Crash log files
crash.log

# Ignore CLI configuration files
.terraformrc terraform.rc

# Ignore any .tfvars files that are generated automatically for each Terraform run. Most
# .tfvars files are managed as part of configuration and so should be included in
# version control.
#
*auto.tfvars

# Ignore override files as they are usually used to override resources locally and so
# are not checked in
override.tf
override.tf.json
*_override.tf
*_override.tf.json

# Include override files you do wish to add to version control using negated pattern
#
# !example_override.tf

# Include tfplan files to ignore the plan output of command: terraform plan -out=tfplan
# example: *tfplan*

# MAC Control
.DS_Store

# IntelliJ
.idea/
126 changes: 126 additions & 0 deletions aws-gov/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Security Reference Architectures (SRA) - Terraform Templates


## Introduction

Databricks has worked with thousands of customers to securely deploy the Databricks platform with appropriate security features to meet their architecture requirements.

This Security Reference Architecture (SRA) repository implements common security features as a unified terraform templates that are typically deployed by our security conscious customers.


## Component Breakdown and Description

In this section, we break down each of the components that we've included in this Security Reference Architecture.

In various `.tf` scripts, we have included direct links to the Databricks Terraform documentation. The [official documentation](https://registry.terraform.io/providers/databricks/databricks/latest/docs) can be found here.


## Operation Mode:

There are four separate operation modes you can choose for the underlying network configurations of your workspaces: **sandbox**, **firewall**, **isolated**, and **custom**.

- **Sandbox**: Sandbox or open egress. Selecting 'sandbox' as the operation mode allows traffic to flow freely to the public internet. This mode is suitable for sandbox or development scenarios where data exfiltration protection is of minimal concern, and developers need to access public APIs, packages, and more.

- **Firewall**: Firewall or limited egress. Choosing 'firewall' as the operation mode permits traffic flow only to a selected list of public addresses. This mode is applicable in situations where open internet access is necessary for certain tasks, but unfiltered traffic is not an option due to the sensitivity of the workloads or data.
- **WARNING**: Due to a limitation in AWS Network Firewall's support for fully qualified domain names (FQDNs) in non-HTTP/HTTPS traffic, an IP address is required to allow communication with the Hive Metastore. This dependency on a static IP introduces the potential for downtime if the Hive Metastore's IP changes. For sensitive production workloads, it is recommended to explore the isolated operation mode or consider alternative firewall solutions that provide better handling of dynamic IPs or FQDNs.

- **Isolated**: Isolated or no egress. Opting for 'isolated' as the operation mode prevents any traffic to the public internet. Traffic is limited to AWS private endpoints, either to AWS services or the Databricks control plane. This mode should be used in cases where access to the public internet is completely unsupported. **NOTE**: Apache Derby Metastore will be required for clusters and non-serverless SQL Warehouses. For more information, please view this [knowledge article](https://kb.databricks.com/metastore/set-up-embedded-metastore).

- **Custom**: Custom or bring your own network. Selecting 'custom' allows you to input your own details for a VPC ID, subnet IDs, security group IDs, and PrivateLink endpoint IDs. This mode is recommended when networking assets are created in different pipelines or are pre-assigned to a team by a centralized infrastructure team.

See the below networking diagrams for more information.


## Infrastructure Deployment

- **Customer-managed VPC**: A [customer-managed VPC](https://docs.databricks.com/administration-guide/cloud-configurations/aws/customer-managed-vpc.html) allows Databricks customers to exercise more control over network configuration to comply with specific cloud security and governance standards that a customer's organization may require.

- **AWS VPC Endpoints for S3, STS, and Kinesis**: Using AWS PrivateLink technology, a VPC endpoint is a service that connects a customer's VPC endpoint to AWS services without traversing public IP addresses. [S3, STS, and Kinesis endpoints](https://docs.databricks.com/administration-guide/cloud-configurations/aws/privatelink.html#step-5-add-vpc-endpoints-for-other-aws-services-recommended-but-optional) are best practices for standard enterprise Databricks deployments. Additional endpoints can be configured depending on use case (e.g. Amazon DynamoDB and AWS Glue).

- **Back-end AWS PrivateLink Connectivity**: AWS PrivateLink provides a private network route from one AWS environment to another. [Back-end PrivateLink](https://docs.databricks.com/administration-guide/cloud-configurations/aws/privatelink.html#overview) is configured so that communication between the customer's data plane and the Databricks control plane does not traverse public IP addresses. This is accomplished through Databricks specific interface VPC endpoints. Front-end PrivateLink is available as well for customers to ensure users traffic remains over the AWS backbone. However front-end PrivateLink is not included in this Terraform template.

- **Scoped-down IAM Policy for the Databricks cross-account role**: A [cross-account role](https://docs.databricks.com/administration-guide/account-api/iam-role.html) is needed for users, jobs, and other third-party tools to spin up Databricks clusters within the customer's data plane environment. This cross-account role can be scoped down to only function within the parameters of the data plane's VPC, subnets, and security group.

- **Restrictive Root Bucket**: Each workspace, prior to creation, registers a [dedicated S3 bucket](https://docs.databricks.com/administration-guide/account-api/aws-storage.html). This bucket is for workspace assets. On AWS, S3 bucket policies can be applied to limit access to the Databricks control plane and the customer data plane.

- **Unity Catalog**: [Unity Catalog](https://docs.databricks.com/data-governance/unity-catalog/index.html) is a unified governance solution for all data and AI assets including files, tables, and machine learning models. Unity Catalog provides a modern approach to granular access controls with centralized policy, auditing, and lineage tracking - all integrated into your Databricks workflow. **NOTE**: SRA creates a workspace specific catalog that is isolated to that individual workspace. To change these settings please update uc_catalog.tf under the workspace_security_modules.


## Optional Deployment Configurations

- **Audit and Billable Usage Logs**: Databricks delivers logs to your S3 buckets. [Audit logs](https://docs.databricks.com/administration-guide/account-settings/audit-logs.html) contain two levels of events: workspace-level audit logs with workspace-level events and account-level audit logs with account-level events. In addition to these logs, you can generate additional events by enabling verbose audit logs. [Billable usage logs](https://docs.databricks.com/administration-guide/account-settings/billable-usage-delivery.html) are delivered daily to an AWS S3 storage bucket. There will be a separate CSV file for each workspace. This file contains historical data about the workspace's cluster usage in Databricks Units (DBUs).
- **System Tables Schemas**: System Tables provide visiblity into access, billing, compute, Lakeflow, and storage logs. These tables can be found within the system catalog in Unity Catalog.

- **Cluster Example**: An example of a cluster and a cluster policy has been included. **NOTE:** Please be aware this will create a cluster within your Databricks workspace including the underlying EC2 instance.

- **IP Access Lists**: IP Access can be enabled to only allow a subset of IPs to access the Databricks workspace console. **NOTE:** Please verify all of the IPs are correct prior to enabling this feature to prevent a lockout scenario.

- **Read Only External Location**: This creates a read-only external location in Unity Catalog for a given bucket as well as the corresponding AWS IAM role.

- **Restrictive Root Bucket**: A restrictive root bucket policy can be applied to the root bucket of the workspace. **NOTE:** Please be aware this bucket is updated frequently, however, may not contain prefixes for the latest product releases.

- **Restrictive Kinesis, STS, and S3 Endpoint Policies**: Restrictive policies for Kinesis, STS, and S3 endpoints can be added for Databricks specific assets. **NOTE:** Please be aware thse policies could be updated and may result in potentially breaking changes. If this is the case, we recommend removing the policy.

- **System Tables**: System tables are a Databricks-hosted analytical store of your account’s operational data found in the system catalog. System tables can be used for historical observability across your account. This is currently in public preview, so is optional to enable or not.

- **Workspace Admin. Configurations**: Workspace administration configurations that can be enabled that align with security best practices. The Terraform resource is experimental, which is why it is optional. Documentation on each configuration is provided in the Terraform file.


## Solution Accelerators

- **Security Analysis Tool (SAT)**: The Security Analysis Tool analyzes customer's Databricks account and workspace security configurations and provides recommendations that can help them follow Databricks' security best practices. This can be enabled into the workspace that is being created. **NOTE:** Please be aware this creates a cluster, a job, and a dashboard within your environment.

- **Audit Log Alerting**: Audit Log Alerting, based on this [blog post](https://www.databricks.com/blog/improve-lakehouse-security-monitoring-using-system-tables-databricks-unity-catalog), creates 40+ SQL alerts to monitor for incidents based on a Zero Trust Architecture (ZTA) model. **NOTE:** Please be aware this creates a cluster, a job, and queries within your environment.


## Additional Security Recommendations and Opportunities

In this section, we break down additional security recommendations and opportunities to maintain a strong security posture that either cannot be configured into this Terraform script or is very specific to individual customers (e.g. SCIM, SSO, Front-End PrivateLink, etc.)

- **Segment Workspaces for Various Levels of Data Separation**: While Databricks has numerous capabilities for isolating different workloads, such as table ACLs and IAM passthrough for very sensitive workloads, the primary isolation method is to move sensitive workloads to a different workspace. This sometimes happens when a customer has very different teams (for example, a security team and a marketing team) who must both analyze different data in Databricks.

- **Avoid Storing Production Datasets in Databricks File Store**: Because the DBFS root is accessible to all users in a workspace, all users can access any data stored here. It is important to instruct users to avoid using this location for storing sensitive data. The default location for managed tables in the Hive metastore on Databricks is the DBFS root; to prevent end users who create managed tables from writing to the DBFS root, declare a location on external storage when creating databases in the Hive metastore.

- **Single Sign-On, Multi-factor Authentication, SCIM Provisioning**: Most production or enterprise deployments enable their workspaces to use [Single Sign-On (SSO)](https://docs.databricks.com/administration-guide/users-groups/single-sign-on/index.html) and multi-factor authentication (MFA). As users are added, changed, and deleted, we recommended customers integrate [SCIM (System for Cross-domain Identity Management)](https://docs.databricks.com/dev-tools/api/latest/scim/index.html)to their account console to sync these actions.

- **Backup Assets from the Databricks Control Plane**: While Databricks does not offer disaster recovery services, many customers use Databricks capabilities, including the Account API, to create a cold (standby) workspace in another region. This can be done using various tools such as the Databricks [migration tool](https://github.com/databrickslabs/migrate), [Databricks sync](https://github.com/databrickslabs/databricks-sync), or the [Terraform exporter](https://registry.terraform.io/providers/databricks/databricks/latest/docs/guides/experimental-exporter)

- **Regularly Restart Databricks Clusters**: When you restart a cluster, it gets the latest images for the compute resource containers and the VM hosts. It is particularly important to schedule regular restarts for long-running clusters such as those used for processing streaming data. If you enable the compliance security profile for your account or your workspace, long-running clusters are automatically restarted after 25 days. Databricks recommends that admins restart clusters manually during a scheduled maintenance window. This reduces the risk of an auto-restart disrupting a scheduled job.

- **Evaluate Whether your Workflow requires using Git Repos or CI/CD**: Mature organizations often build production workloads by using CI/CD to integrate code scanning, better control permissions, perform linting, and more. When there is highly sensitive data analyzed, a CI/CD process can also allow scanning for known scenarios such as hard coded secrets.


## Getting Started

1. Clone this Repo
2. Install [Terraform](https://developer.hashicorp.com/terraform/downloads)
3. Decide which [operation](https://github.com/databricks/terraform-databricks-sra/tree/main/aws-gov/tf#operation-mode) mode you'd like to use.
4. Fill out `sra.tf` in place
5. Fill out `template.tfvars.example` remove the .example part of the file name
6. Configure the [AWS](https://registry.terraform.io/providers/hashicorp/aws/latest/docs#authentication-and-configuration) and [Databricks](https://registry.terraform.io/providers/databricks/databricks/latest/docs#authentication) provider authentication
7. CD into `tf`
8. Run `terraform init`
9. Run `terraform validate`
10. From `tf` directory, run `terraform plan -var-file ../example.tfvars`
11. Run `terraform apply -var-file ../example.tfvars`


## Network Diagram - Sandbox
![Architecture Diagram](https://github.com/databricks/terraform-databricks-sra/blob/main/aws-gov/img/Sandbox%20-%20Network%20Topology.png)


## Network Diagram - Firewall
![Architecture Diagram](https://github.com/databricks/terraform-databricks-sra/blob/main/aws-gov/img/Firewall%20-%20Network%20Topology.png)


## Network Diagram - Isolated
![Architecture Diagram](https://github.com/databricks/terraform-databricks-sra/blob/main/aws-gov/img/Isolated%20-%20Network%20Topology.png)


## FAQ

- **I've cloned the GitHub repo, what's the recommended way to add Databricks additional resources to it?**

If you'd like to add additional resources to the repository, the first step is to identify if this resource is using the **account** or **workspace** provider.

For example, if it uses the **account** provider, then we'd recommend creating a new module under the [modules/sra/databricks_account](https://github.com/databricks/terraform-databricks-sra/tree/main/aws-gov/tf/modules/sra/databricks_account) folder. Then, that module can be called in the top level [databricks_account.tf](https://github.com/databricks/terraform-databricks-sra/blob/main/aws-gov/tf/modules/sra/databricks_account.tf) file. This process is the same for the workspace provider by placing a new module in the [modules/sra/databricks_workspace folder](https://github.com/databricks/terraform-databricks-sra/tree/main/aws-gov/tf/modules/sra/databricks_workspace) and call it in the [databricks_workspace.tf](https://github.com/databricks/terraform-databricks-sra/blob/main/aws-gov/tf/modules/sra/databricks_workspace.tf) file.
Binary file added aws-gov/img/Firewall - Network Topology.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added aws-gov/img/Isolated - Network Topology.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added aws-gov/img/Sandbox - Network Topology.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 88e1dbf

Please sign in to comment.