Update data catalog design doc to explain how non-SQL scripting fits …

…into the catalog (#81) * Update data catalog design doc to sketch out the relationship between dbt and Glue * Update Glue strategy in data-catalog.md design doc to distinguish sources from exposures * Fix typo in ephemeral models link in data-catalog.md * Fix small typos in data-catalog.md * Adjust note on managing dbt -> Glue dependencies in data-catalog.md
ccao-data · Aug 25, 2023 · 5603057 · 5603057
1 parent 1a4424a
commit 5603057
Showing 1 changed file with 91 additions and 2 deletions.
diff --git a/documentation/design-docs/data-catalog.md b/documentation/design-docs/data-catalog.md
@@ -241,7 +241,7 @@ As such, we think it would be  more prudent for us to build with dbt Core
 and design our own orchestration/monitoring/authentication integrations on top.
 Hence, when this doc refers to "dbt", we are actually referring to dbt Core.
 
-The downside of this choice is that we would have to choose a separate tool for
+One downside of this choice is that we would have to choose a separate tool for
 orchestrating and monitoring our DAGs if we move forward with dbt. This is an
 important fact to note in our decision, because [orchestrators are notoriously
 controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
@@ -254,6 +254,21 @@ controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
 As such, we evaluate this choice with an eye towards options for third-party
 orchestration and monitoring.
 
+Another downside is that dbt does not have robust support for the types of
+non-SQL scripted transformations we sometimes want to produce, like our
+[sales value flagging script](https://github.com/ccao-data/model-sales-val).
+There is currently an effort underway to provide better support for [Python
+models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models)
+in dbt, but only [three data
+platforms](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms)
+have been supported since the launch of Python models in late 2022, and there
+is [not yet a clear
+roadmap](https://github.com/dbt-labs/dbt-core/discussions/5742) for the future
+development of Python models. As such, we will need to use a separate system to
+keep track of our scripted transformations. We provide a brief sketch of the
+design of such a system in the [Tracking raw data and ML
+transformations](#tracking-raw-data-and-ml-transformations) section below.
+
 ### Demo
 
 See the
@@ -339,6 +354,80 @@ and validating our data using dbt:
   failures](https://docs.getdbt.com/reference/resource-configs/store_failures)
   and [AWS SNS](https://aws.amazon.com/sns/) for notification management.
 
+#### Tracking raw data and ML transformations
+
+* We will keep our raw data extraction scripts separate from the dbt DAG, per
+  [dbt's recommendation](https://docs.getdbt.com/terms/data-extraction).
+* Raw data will be referenced in dbt transformations using the [`source()`
+  function](https://docs.getdbt.com/reference/dbt-jinja-functions/source).
+* [Source freshness
+  checks](https://docs.getdbt.com/docs/deploy/source-freshness) will be used to
+  ensure that raw data is updated appropriately prior to transformation.
+* Where possible, the intermediate transformations defined in the
+  `aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
+  in SQL and moved into the dbt DAG. During the transition period, while some
+  transformations are still written in R, we will treat their output as if it
+  were raw data and reference it using `source()`. Any transformations that
+  can't be easily rewritten in SQL will continue to be defined this way in the
+  long term.
+* Intermediate or final transformations that require CPU- or memory-intensive
+  operations like running machine learning models will be defined in Python,
+  run as AWS Glue jobs, and defined as [ephemeral
+  models](https://docs.getdbt.com/docs/build/materializations#ephemeral) in the
+  dbt DAG. This will be true even in cases where the Glue jobs depend on
+  models produced by the dbt DAG, e.g. the tables produced by
+  [`model-sales-val`](https://github.com/ccao-data/model-sales-val). A bullet
+  below will explain how we will manage circular dependencies between these
+  services.
+* Glue jobs will be kept under version control and deployed to AWS using
+  [Terraform](https://www.terraform.io/) run in GitHub Actions on
+  commits to their repo's main branch. We will write a reusable [composite
+  action](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action)
+  that performs the following operations:
+    1. Runs `terraform apply` to recreate the Glue job definition in AWS
+      1. In doing so, inserts the current Git SHA as a command argument in the
+         Glue job definition so that the job script can read the SHA and use it
+         for versioning.
+      2. Supports creating staging jobs that we can use for testing during CI.
+    2. Uploads the newest version of the script to the proper bucket in S3
+* There will be three expected ways in which we handle dependencies between
+  dbt and Glue, depending on the direction of the dependency graph:
+    * In cases where dbt depends on the output of a Glue job (Glue -> dbt), we
+      will treat the Glue job output as an ephemeral model or
+      a [source](https://docs.getdbt.com/docs/build/sources) in the DAG and
+      schedule the job as necessary to maintain freshness.
+        * If we would like to rebuild the dbt models every time the Glue
+          source data updates, we can schedule the job via GitHub Actions
+          instead of the Glue job scheduler and configure GitHub Actions to
+          rerun dbt in case of a successful Glue job run.
+    * In cases where a Glue job depends on the output of dbt (dbt -> Glue),
+      we will document the Glue job as an
+      [exposure](https://docs.getdbt.com/docs/build/exposures) in the DAG.
+      Exposures should make use of the `depends_on` config attribute in order
+      to properly document the lineage of the data created by Glue.
+      If we would like to ensure that we run the Glue job every time the
+      dbt source data updates, we can schedule the Glue job using a GitHub
+      Actions workflow and configure the workflow to check the dbt state
+      to see if it needs to be rerun.
+    * In case of a circular dependency between dbt and Glue (dbt -> Glue ->
+      dbt), we will document the Glue job as an [ephemeral
+      model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in
+      dbt so that we can specify its dependencies using [the `depends_on`
+      attribute](https://docs.getdbt.com/reference/dbt-jinja-functions/ref#forcing-dependencies).
+      If we would like to be able to build the entire DAG from scratch,
+      including running the Glue jobs and transforming their output using
+      dbt, we can separate the dbt config into two targets, use the second
+      bullet approach above (dbt -> Glue) to trigger the Glue job once the first
+      target has completed, and update the dbt wrapper script to initiate
+      the second dbt target build once the Glue job has completed.
+        * Any such wrapper script should also provide the caller with the option
+          to skip running the Glue job if the AWS CLI can determine that
+          the output of the Glue job already exists.
+        * The opposite circular dependency (Glue -> dbt -> Glue) should not
+          require a special solution since it is just a combination of the
+          first and second bullets above (i.e. one Glue job acting as a source
+          and another acting as an exposure).
+
 ### Pros
 
 * Designed around software engineering best practices (version control, reproducibility, testing, etc.)
@@ -348,7 +437,7 @@ and validating our data using dbt:
 
 ### Cons
 
-* No native support for R scripting as a means of building models, so we would have to either rewrite our raw data extraction scripts or use some kind of hack like running our R scripts from a Python function
+* No native support for Python or R scripting as a means of building models, so we can't incorporate our raw data extraction scripts
 * We would need to use a [community plugin](https://dbt-athena.github.io/) for Athena support; this plugin is not supported on dbt Cloud, if we ever decided to move to that
 * Requires a separate orchestrator for automation, monitoring, and alerting
 * Tests currently do not support the same rich documentation descriptions that other entities do (see [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/2578))