Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

Merged
merged 6 commits into from
Aug 25, 2023
93 changes: 91 additions & 2 deletions documentation/design-docs/data-catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@ As such, we think it would be more prudent for us to build with dbt Core
and design our own orchestration/monitoring/authentication integrations on top.
Hence, when this doc refers to "dbt", we are actually referring to dbt Core.

The downside of this choice is that we would have to choose a separate tool for
One downside of this choice is that we would have to choose a separate tool for
orchestrating and monitoring our DAGs if we move forward with dbt. This is an
important fact to note in our decision, because [orchestrators are notoriously
controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
Expand All @@ -254,6 +254,21 @@ controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt):
As such, we evaluate this choice with an eye towards options for third-party
orchestration and monitoring.

Another downside is that dbt does not have robust support for the types of
non-SQL scripted transformations we sometimes want to produce, like our
[sales value flagging script](https://github.com/ccao-data/model-sales-val).
There is currently an effort underway to provide better support for [Python
models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models)
in dbt, but only [three data
platforms](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms)
have been supported since the launch of Python models in late 2022, and there
is [not yet a clear
roadmap](https://github.com/dbt-labs/dbt-core/discussions/5742) for the future
development of Python models. As such, we will need to use a separate system to
keep track of our scripted transformations. We provide a brief sketch of the
design of such a system in the [Tracking raw data and ML
transformations](#tracking-raw-data-and-ml-transformations) section below.

### Demo

See the
Expand Down Expand Up @@ -339,6 +354,80 @@ and validating our data using dbt:
failures](https://docs.getdbt.com/reference/resource-configs/store_failures)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but this looks really neat and I'm extremely curious what is actually stored and how.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this will be a useful feature when we start engineering a pipeline to report test output to different stakeholders! I've played around with it a bit; at a high level, each test gets its own view with a set of rows that failed for the test. I think it would take some engineering effort to transform the raw set of rows into a format that highlights the error at hand and helps a person make a decision about how to fix it, but it's nice to know we have this option as we start to think about the notification stream for our tests.

and [AWS SNS](https://aws.amazon.com/sns/) for notification management.

#### Tracking raw data and ML transformations

* We will keep our raw data extraction scripts separate from the dbt DAG, per
[dbt's recommendation](https://docs.getdbt.com/terms/data-extraction).
* Raw data will be referenced in dbt transformations using the [`source()`
function](https://docs.getdbt.com/reference/dbt-jinja-functions/source).
* [Source freshness
checks](https://docs.getdbt.com/docs/deploy/source-freshness) will be used to
ensure that raw data is updated appropriately prior to transformation.
* Where possible, the intermediate transformations defined in the
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
in SQL and moved into the dbt DAG. During the transition period, while some
Comment on lines +366 to +368
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transitioning our R transformation scripts to SQL is a big new step that we haven't talked about yet. Based on my read of aws-s3/scripts-ccao-data-warehouse-us-east-1, it seems like everything can either be transitioned to SQL or comfortably considered a "source" instead of a data transformation; does that match your understanding? It will be a lift to move all this stuff but I think our future selves will thank us if we can move as much stuff as possible into the DAG.

This also reminds me that we don't really have an issue yet to track the work to transition CTAs to the DAG; perhaps I'll go ahead and open that in the Architecture Upgrades milestone and we can prioritize it later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1 are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, I added the CTA ➡️ DAG issue here! #101

transformations are still written in R, we will treat their output as if it
were raw data and reference it using `source()`. Any transformations that
can't be easily rewritten in SQL will continue to be defined this way in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful long-term to rewrite some of them as Python, especially if we expect dbt to add better Python support in the near future. Certainly we can write any new transform scripts that don't work as SQL using Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely aligned with this! Should I add it as a note to the doc, or are we fine with keeping this as tacit knowledge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as tacit knowledge for the time being.

long term.
* Intermediate or final transformations that require CPU- or memory-intensive
operations like running machine learning models will be defined in Python,
run as AWS Glue jobs, and defined as [ephemeral
models](https://docs.getdbt.com/docs/build/materializations#ephemeral) in the
dbt DAG. This will be true even in cases where the Glue jobs depend on
models produced by the dbt DAG, e.g. the tables produced by
[`model-sales-val`](https://github.com/ccao-data/model-sales-val). A bullet
below will explain how we will manage circular dependencies between these
services.
* Glue jobs will be kept under version control and deployed to AWS using
[Terraform](https://www.terraform.io/) run in GitHub Actions on
commits to their repo's main branch. We will write a reusable [composite
action](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action)
that performs the following operations:
1. Runs `terraform apply` to recreate the Glue job definition in AWS
1. In doing so, inserts the current Git SHA as a command argument in the
Glue job definition so that the job script can read the SHA and use it
for versioning.
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
2. Supports creating staging jobs that we can use for testing during CI.
2. Uploads the newest version of the script to the proper bucket in S3
* There will be three expected ways in which we handle dependencies between
dbt and Glue, depending on the direction of the dependency graph:
* In cases where dbt depends on the output of a Glue job (Glue -> dbt), we
will treat the Glue job output as an ephemeral model or
a [source](https://docs.getdbt.com/docs/build/sources) in the DAG and
schedule the job as necessary to maintain freshness.
* If we would like to rebuild the dbt models every time the Glue
source data updates, we can schedule the job via GitHub Actions
instead of the Glue job scheduler and configure GitHub Actions to
rerun dbt in case of a successful Glue job run.
* In cases where a Glue job depends on the output of dbt (dbt -> Glue),
we will document the Glue job as an
[exposure](https://docs.getdbt.com/docs/build/exposures) in the DAG.
jeancochrane marked this conversation as resolved.
Show resolved Hide resolved
Exposures should make use of the `depends_on` config attribute in order
to properly document the lineage of the data created by Glue.
If we would like to ensure that we run the Glue job every time the
dbt source data updates, we can schedule the Glue job using a GitHub
Actions workflow and configure the workflow to check the dbt state
to see if it needs to be rerun.
* In case of a circular dependency between dbt and Glue (dbt -> Glue ->
dbt), we will document the Glue job as an [ephemeral
model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in
Comment on lines +413 to +414
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Will this work with ephemeral models? We will soon need to include sale.flag in default.vw_pin_sale (see #97), but ephemeral models can't be queried directly if I'm reading the docs correctly.

Maybe the sales transformation doesn't apply here since the sale.flag output is reincorporated into the DAG automatically via inclusion in a view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually can query from ephemeral models in the context of a view! As an example, I tested with two dummy models, one of which is ephemeral and one of which is a view:

-- models/iasworld/valclass_ephemeral.sql
{{ config(materialized='ephemeral') }}

select * from {{ source('iasworld', 'valclass') }}
-- models/iasworld/valclass_view.sql
{{ config(materialized='view') }}

select * from {{ ref('valclass_ephemeral') }}

You can see the definition of the view in Athena under dev_jecochr_iasworld.valclass_view and test a query against it:

CREATE OR REPLACE VIEW "valclass_view" AS 
WITH
  __dbt__cte__valclass_ephemeral AS (
   SELECT *
   FROM
     "awsdatacatalog"."iasworld"."valclass"
) 
SELECT *
FROM
  __dbt__cte__valclass_ephemeral
select * from "dev_jecochr_iasworld"."valclass_view" limit 10

I'll delete this dummy view from Athena when I pull in this PR, but in the meantime it's up there in case you'd like to confirm with your own testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so it just ends up as a queryable CTE. That works for me!

dbt so that we can specify its dependencies using [the `depends_on`
attribute](https://docs.getdbt.com/reference/dbt-jinja-functions/ref#forcing-dependencies).
If we would like to be able to build the entire DAG from scratch,
including running the Glue jobs and transforming their output using
dbt, we can separate the dbt config into two targets, use the second
bullet approach above (dbt -> Glue) to trigger the Glue job once the first
target has completed, and update the dbt wrapper script to initiate
the second dbt target build once the Glue job has completed.
Comment on lines +417 to +422
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only do this in cases where we absolutely need to be able to build the full DAG from scratch, since it would introduce a lot of complexity into the build process. But I wanted to sketch it out here to make it clear that we should be able to tie together basically any Glue job with our dbt catalog given enough engineering effort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable; should we strike this section of the doc until we make a decision on that, then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to keep it for now! Let's wait until one of us actually (hopefully) finds something clever.

* Any such wrapper script should also provide the caller with the option
to skip running the Glue job if the AWS CLI can determine that
the output of the Glue job already exists.
* The opposite circular dependency (Glue -> dbt -> Glue) should not
require a special solution since it is just a combination of the
first and second bullets above (i.e. one Glue job acting as a source
and another acting as an exposure).

### Pros

* Designed around software engineering best practices (version control, reproducibility, testing, etc.)
Expand All @@ -348,7 +437,7 @@ and validating our data using dbt:

### Cons

* No native support for R scripting as a means of building models, so we would have to either rewrite our raw data extraction scripts or use some kind of hack like running our R scripts from a Python function
* No native support for Python or R scripting as a means of building models, so we can't incorporate our raw data extraction scripts
* We would need to use a [community plugin](https://dbt-athena.github.io/) for Athena support; this plugin is not supported on dbt Cloud, if we ever decided to move to that
* Requires a separate orchestrator for automation, monitoring, and alerting
* Tests currently do not support the same rich documentation descriptions that other entities do (see [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/2578))
Expand Down
Loading