From 1342737697e30fb740e9f5d520e509b3272bc2d6 Mon Sep 17 00:00:00 2001 From: Jean Cochrane Date: Fri, 18 Aug 2023 13:36:52 -0500 Subject: [PATCH] Update data catalog design doc to sketch out the relationship between dbt and Glue --- documentation/design-docs/data-catalog.md | 79 ++++++++++++++++++++++- 1 file changed, 77 insertions(+), 2 deletions(-) diff --git a/documentation/design-docs/data-catalog.md b/documentation/design-docs/data-catalog.md index 6563b7fab..a27447601 100644 --- a/documentation/design-docs/data-catalog.md +++ b/documentation/design-docs/data-catalog.md @@ -241,7 +241,7 @@ As such, we think it would be more prudent for us to build with dbt Core and design our own orchestration/monitoring/authentication integrations on top. Hence, when this doc refers to "dbt", we are actually referring to dbt Core. -The downside of this choice is that we would have to choose a separate tool for +One downside of this choice is that we would have to choose a separate tool for orchestrating and monitoring our DAGs if we move forward with dbt. This is an important fact to note in our decision, because [orchestrators are notoriously controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt): @@ -254,6 +254,21 @@ controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt): As such, we evaluate this choice with an eye towards options for third-party orchestration and monitoring. +Another downside is that dbt does not have robust support for the types of +non-SQL scripted transformations we sometimes want to produce, like our +[sales value flagging script](https://github.com/ccao-data/model-sales-val). +There is currently an effort underway to provide better support for [Python +models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models) +in dbt, but only [three data +platforms](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms) +have been supported since the launch of Python models in late 2022, and there +is [not yet a clear +roadmap](https://github.com/dbt-labs/dbt-core/discussions/5742) for the future +development of Python models. As such, we will need to use a separate system to +keep track of our scripted transformations. We provide a brief sketch of the +design of such a system in the [Tracking raw data and ML +transformations](#tracking-raw-data-and-ml-transformations) section below. + ### Demo See the @@ -339,6 +354,66 @@ and validating our data using dbt: failures](https://docs.getdbt.com/reference/resource-configs/store_failures) and [AWS SNS](https://aws.amazon.com/sns/) for notification management. +#### Tracking raw data and ML transformations + +* We will keep our raw data extraction scripts separate from the dbt DAG, per + [dbt's recommendation](https://docs.getdbt.com/terms/data-extraction). +* Raw data will be referenced in dbt transformations using the [`source()` + function](https://docs.getdbt.com/reference/dbt-jinja-functions/source). +* [Source freshness + checks](https://docs.getdbt.com/docs/deploy/source-freshness) will be used to + ensure that raw data is updated appropriately prior to transformation. + * Where possible, the intermediate transformations defined in the + `aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten + in SQL and moved into the dbt DAG. During the transition period, while some + transformations are still written in R, we will treat their output as if it + were raw data and reference it using `source()`. Any transformations that + can't be easily rewritten in SQL will continue to be defined this way in the + long term. +* Intermediate transformations that require CPU- or memory-intensive operations + like running machine learning models will be defined in Python, run as + AWS Glue jobs, and defined as [ephemeral + models](https://docs.getdbt.com/docs/deploy/source-freshness) in the dbt DAG. + This will be true even in cases where the Glue jobs depend on models produced + by the dbt DAG, e.g. the tables produced by + [`model-sales-val`](https://github.com/ccao-data/model-sales-val). +* Glue jobs will be kept under version control and deployed to AWS using + [Terraform](https://www.terraform.io/) run in GitHub Actions on + commits to their repo's main branch. We will write a reusable [composite + action](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action) + that performs the following operations: + 1. Runs `terraform apply` to recreate the Glue job definition in AWS + * In doing so, inserts the current Git SHA as a command argument in the + Glue job definition so that the job script can read the SHA and use it + for versioning + * Supports creating staging jobs that we can use for testing during CI + 2. Uploads the newest version of the script to the proper bucket in S3 +* There will be three expected ways in which we handle dependencies between + dbt and glue, depending on the direction of the dependency graph: + * In cases where dbt depends on the output of a Glue job (Glue -> dbt), we + will treat the Glue job output as a `source()` in the dependant dbt + models and schedule the job as necessary to maintain freshness. + * If we would like to rebuild the dbt models every time the Glue + source data updates, we can schedule the job via GitHub Actions + instead of the Glue job scheduler and configure GitHub Actions to + rerun dbt in case of a successful Glue job run. + * In cases where a Glue job depends on the output of dbt (dbt -> Glue), + we will write a wrapper script around `dbt run` that uses the Glue + `StartJobRun` API + ([docs](https://docs.aws.amazon.com/glue/latest/webapi/API_StartJobRun.html)) + to trigger job runs once the dbt build completes successfully. + * In case of a circular dependency between dbt and Glue (dbt -> Glue -> + dbt), we will separate the dbt config into two targets, use the second + bullet approach (dbt -> Glue) to trigger the Glue job once the first + target has completed, and update the dbt wrapper script to initiate + the second dbt target build once the Glue job has completed. + * This wrapper script should also provide the caller with the option + to skip running the Glue job if the AWS CLI can determine that + the output of the Glue job already exists. + * The opposite circular dependency (Glue -> dbt -> Glue) should not + require a special solution since it is just a combination of the + first and second bullets above. + ### Pros * Designed around software engineering best practices (version control, reproducibility, testing, etc.) @@ -348,7 +423,7 @@ and validating our data using dbt: ### Cons -* No native support for R scripting as a means of building models, so we would have to either rewrite our raw data extraction scripts or use some kind of hack like running our R scripts from a Python function +* No native support for Python or R scripting as a means of building models, so we can't incorporate our raw data extraction scripts * We would need to use a [community plugin](https://dbt-athena.github.io/) for Athena support; this plugin is not supported on dbt Cloud, if we ever decided to move to that * Requires a separate orchestrator for automation, monitoring, and alerting * Tests currently do not support the same rich documentation descriptions that other entities do (see [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/2578))