Skip to content

Commit

Permalink
Update data catalog design doc to explain how non-SQL scripting fits …
Browse files Browse the repository at this point in the history
…into the catalog (#81)

* Update data catalog design doc to sketch out the relationship between dbt and Glue

* Update Glue strategy in data-catalog.md design doc to distinguish sources from exposures

* Fix typo in ephemeral models link in data-catalog.md

* Fix small typos in data-catalog.md

* Adjust note on managing dbt -> Glue dependencies in data-catalog.md
  • Loading branch information
jeancochrane authored Aug 25, 2023
1 parent 1a4424a commit 5603057
Showing 1 changed file with 91 additions and 2 deletions.
93 changes: 91 additions & 2 deletions documentation/design-docs/data-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ As such, we think it would be more prudent for us to build with dbt Core
and design our own orchestration/monitoring/authentication integrations on top.
Hence, when this doc refers to "dbt", we are actually referring to dbt Core.

The downside of this choice is that we would have to choose a separate tool for
One downside of this choice is that we would have to choose a separate tool for
orchestrating and monitoring our DAGs if we move forward with dbt. This is an
important fact to note in our decision, because [orchestrators are notoriously
controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
Expand All @@ -254,6 +254,21 @@ controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
As such, we evaluate this choice with an eye towards options for third-party
orchestration and monitoring.

Another downside is that dbt does not have robust support for the types of
non-SQL scripted transformations we sometimes want to produce, like our
[sales value flagging script](https://github.com/ccao-data/model-sales-val).
There is currently an effort underway to provide better support for [Python
models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models)
in dbt, but only [three data
platforms](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms)
have been supported since the launch of Python models in late 2022, and there
is [not yet a clear
roadmap](https://github.com/dbt-labs/dbt-core/discussions/5742) for the future
development of Python models. As such, we will need to use a separate system to
keep track of our scripted transformations. We provide a brief sketch of the
design of such a system in the [Tracking raw data and ML
transformations](#tracking-raw-data-and-ml-transformations) section below.

### Demo

See the
Expand Down Expand Up @@ -339,6 +354,80 @@ and validating our data using dbt:
failures](https://docs.getdbt.com/reference/resource-configs/store_failures)
and [AWS SNS](https://aws.amazon.com/sns/) for notification management.

#### Tracking raw data and ML transformations

* We will keep our raw data extraction scripts separate from the dbt DAG, per
[dbt's recommendation](https://docs.getdbt.com/terms/data-extraction).
* Raw data will be referenced in dbt transformations using the [`source()`
function](https://docs.getdbt.com/reference/dbt-jinja-functions/source).
* [Source freshness
checks](https://docs.getdbt.com/docs/deploy/source-freshness) will be used to
ensure that raw data is updated appropriately prior to transformation.
* Where possible, the intermediate transformations defined in the
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
in SQL and moved into the dbt DAG. During the transition period, while some
transformations are still written in R, we will treat their output as if it
were raw data and reference it using `source()`. Any transformations that
can't be easily rewritten in SQL will continue to be defined this way in the
long term.
* Intermediate or final transformations that require CPU- or memory-intensive
operations like running machine learning models will be defined in Python,
run as AWS Glue jobs, and defined as [ephemeral
models](https://docs.getdbt.com/docs/build/materializations#ephemeral) in the
dbt DAG. This will be true even in cases where the Glue jobs depend on
models produced by the dbt DAG, e.g. the tables produced by
[`model-sales-val`](https://github.com/ccao-data/model-sales-val). A bullet
below will explain how we will manage circular dependencies between these
services.
* Glue jobs will be kept under version control and deployed to AWS using
[Terraform](https://www.terraform.io/) run in GitHub Actions on
commits to their repo's main branch. We will write a reusable [composite
action](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action)
that performs the following operations:
1. Runs `terraform apply` to recreate the Glue job definition in AWS
1. In doing so, inserts the current Git SHA as a command argument in the
Glue job definition so that the job script can read the SHA and use it
for versioning.
2. Supports creating staging jobs that we can use for testing during CI.
2. Uploads the newest version of the script to the proper bucket in S3
* There will be three expected ways in which we handle dependencies between
dbt and Glue, depending on the direction of the dependency graph:
* In cases where dbt depends on the output of a Glue job (Glue -> dbt), we
will treat the Glue job output as an ephemeral model or
a [source](https://docs.getdbt.com/docs/build/sources) in the DAG and
schedule the job as necessary to maintain freshness.
* If we would like to rebuild the dbt models every time the Glue
source data updates, we can schedule the job via GitHub Actions
instead of the Glue job scheduler and configure GitHub Actions to
rerun dbt in case of a successful Glue job run.
* In cases where a Glue job depends on the output of dbt (dbt -> Glue),
we will document the Glue job as an
[exposure](https://docs.getdbt.com/docs/build/exposures) in the DAG.
Exposures should make use of the `depends_on` config attribute in order
to properly document the lineage of the data created by Glue.
If we would like to ensure that we run the Glue job every time the
dbt source data updates, we can schedule the Glue job using a GitHub
Actions workflow and configure the workflow to check the dbt state
to see if it needs to be rerun.
* In case of a circular dependency between dbt and Glue (dbt -> Glue ->
dbt), we will document the Glue job as an [ephemeral
model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in
dbt so that we can specify its dependencies using [the `depends_on`
attribute](https://docs.getdbt.com/reference/dbt-jinja-functions/ref#forcing-dependencies).
If we would like to be able to build the entire DAG from scratch,
including running the Glue jobs and transforming their output using
dbt, we can separate the dbt config into two targets, use the second
bullet approach above (dbt -> Glue) to trigger the Glue job once the first
target has completed, and update the dbt wrapper script to initiate
the second dbt target build once the Glue job has completed.
* Any such wrapper script should also provide the caller with the option
to skip running the Glue job if the AWS CLI can determine that
the output of the Glue job already exists.
* The opposite circular dependency (Glue -> dbt -> Glue) should not
require a special solution since it is just a combination of the
first and second bullets above (i.e. one Glue job acting as a source
and another acting as an exposure).

### Pros

* Designed around software engineering best practices (version control, reproducibility, testing, etc.)
Expand All @@ -348,7 +437,7 @@ and validating our data using dbt:

### Cons

* No native support for R scripting as a means of building models, so we would have to either rewrite our raw data extraction scripts or use some kind of hack like running our R scripts from a Python function
* No native support for Python or R scripting as a means of building models, so we can't incorporate our raw data extraction scripts
* We would need to use a [community plugin](https://dbt-athena.github.io/) for Athena support; this plugin is not supported on dbt Cloud, if we ever decided to move to that
* Requires a separate orchestrator for automation, monitoring, and alerting
* Tests currently do not support the same rich documentation descriptions that other entities do (see [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/2578))
Expand Down

0 comments on commit 5603057

Please sign in to comment.