Skip to content

Commit

Permalink
Update data catalog design doc to sketch out the relationship between…
Browse files Browse the repository at this point in the history
… dbt and Glue
  • Loading branch information
jeancochrane committed Aug 18, 2023
1 parent 90ee7f1 commit 1342737
Showing 1 changed file with 77 additions and 2 deletions.
79 changes: 77 additions & 2 deletions documentation/design-docs/data-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ As such, we think it would be more prudent for us to build with dbt Core
and design our own orchestration/monitoring/authentication integrations on top.
Hence, when this doc refers to "dbt", we are actually referring to dbt Core.

The downside of this choice is that we would have to choose a separate tool for
One downside of this choice is that we would have to choose a separate tool for
orchestrating and monitoring our DAGs if we move forward with dbt. This is an
important fact to note in our decision, because [orchestrators are notoriously
controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
Expand All @@ -254,6 +254,21 @@ controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
As such, we evaluate this choice with an eye towards options for third-party
orchestration and monitoring.

Another downside is that dbt does not have robust support for the types of
non-SQL scripted transformations we sometimes want to produce, like our
[sales value flagging script](https://github.com/ccao-data/model-sales-val).
There is currently an effort underway to provide better support for [Python
models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models)
in dbt, but only [three data
platforms](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms)
have been supported since the launch of Python models in late 2022, and there
is [not yet a clear
roadmap](https://github.com/dbt-labs/dbt-core/discussions/5742) for the future
development of Python models. As such, we will need to use a separate system to
keep track of our scripted transformations. We provide a brief sketch of the
design of such a system in the [Tracking raw data and ML
transformations](#tracking-raw-data-and-ml-transformations) section below.

### Demo

See the
Expand Down Expand Up @@ -339,6 +354,66 @@ and validating our data using dbt:
failures](https://docs.getdbt.com/reference/resource-configs/store_failures)
and [AWS SNS](https://aws.amazon.com/sns/) for notification management.

#### Tracking raw data and ML transformations

* We will keep our raw data extraction scripts separate from the dbt DAG, per
[dbt's recommendation](https://docs.getdbt.com/terms/data-extraction).
* Raw data will be referenced in dbt transformations using the [`source()`
function](https://docs.getdbt.com/reference/dbt-jinja-functions/source).
* [Source freshness
checks](https://docs.getdbt.com/docs/deploy/source-freshness) will be used to
ensure that raw data is updated appropriately prior to transformation.
* Where possible, the intermediate transformations defined in the
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
in SQL and moved into the dbt DAG. During the transition period, while some
transformations are still written in R, we will treat their output as if it
were raw data and reference it using `source()`. Any transformations that
can't be easily rewritten in SQL will continue to be defined this way in the
long term.
* Intermediate transformations that require CPU- or memory-intensive operations
like running machine learning models will be defined in Python, run as
AWS Glue jobs, and defined as [ephemeral
models](https://docs.getdbt.com/docs/deploy/source-freshness) in the dbt DAG.
This will be true even in cases where the Glue jobs depend on models produced
by the dbt DAG, e.g. the tables produced by
[`model-sales-val`](https://github.com/ccao-data/model-sales-val).
* Glue jobs will be kept under version control and deployed to AWS using
[Terraform](https://www.terraform.io/) run in GitHub Actions on
commits to their repo's main branch. We will write a reusable [composite
action](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action)
that performs the following operations:
1. Runs `terraform apply` to recreate the Glue job definition in AWS
* In doing so, inserts the current Git SHA as a command argument in the
Glue job definition so that the job script can read the SHA and use it
for versioning
* Supports creating staging jobs that we can use for testing during CI
2. Uploads the newest version of the script to the proper bucket in S3
* There will be three expected ways in which we handle dependencies between
dbt and glue, depending on the direction of the dependency graph:
* In cases where dbt depends on the output of a Glue job (Glue -> dbt), we
will treat the Glue job output as a `source()` in the dependant dbt
models and schedule the job as necessary to maintain freshness.
* If we would like to rebuild the dbt models every time the Glue
source data updates, we can schedule the job via GitHub Actions
instead of the Glue job scheduler and configure GitHub Actions to
rerun dbt in case of a successful Glue job run.
* In cases where a Glue job depends on the output of dbt (dbt -> Glue),
we will write a wrapper script around `dbt run` that uses the Glue
`StartJobRun` API
([docs](https://docs.aws.amazon.com/glue/latest/webapi/API_StartJobRun.html))
to trigger job runs once the dbt build completes successfully.
* In case of a circular dependency between dbt and Glue (dbt -> Glue ->
dbt), we will separate the dbt config into two targets, use the second
bullet approach (dbt -> Glue) to trigger the Glue job once the first
target has completed, and update the dbt wrapper script to initiate
the second dbt target build once the Glue job has completed.
* This wrapper script should also provide the caller with the option
to skip running the Glue job if the AWS CLI can determine that
the output of the Glue job already exists.
* The opposite circular dependency (Glue -> dbt -> Glue) should not
require a special solution since it is just a combination of the
first and second bullets above.

### Pros

* Designed around software engineering best practices (version control, reproducibility, testing, etc.)
Expand All @@ -348,7 +423,7 @@ and validating our data using dbt:

### Cons

* No native support for R scripting as a means of building models, so we would have to either rewrite our raw data extraction scripts or use some kind of hack like running our R scripts from a Python function
* No native support for Python or R scripting as a means of building models, so we can't incorporate our raw data extraction scripts
* We would need to use a [community plugin](https://dbt-athena.github.io/) for Athena support; this plugin is not supported on dbt Cloud, if we ever decided to move to that
* Requires a separate orchestrator for automation, monitoring, and alerting
* Tests currently do not support the same rich documentation descriptions that other entities do (see [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/2578))
Expand Down

0 comments on commit 1342737

Please sign in to comment.