-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81
Changes from all commits
ef5afdd
c3f2b31
3d250fa
b92c3eb
a04e4f5
da2d0ac
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -241,7 +241,7 @@ As such, we think it would be more prudent for us to build with dbt Core | |
and design our own orchestration/monitoring/authentication integrations on top. | ||
Hence, when this doc refers to "dbt", we are actually referring to dbt Core. | ||
|
||
The downside of this choice is that we would have to choose a separate tool for | ||
One downside of this choice is that we would have to choose a separate tool for | ||
orchestrating and monitoring our DAGs if we move forward with dbt. This is an | ||
important fact to note in our decision, because [orchestrators are notoriously | ||
controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt): | ||
|
@@ -254,6 +254,21 @@ controversial](https://stkbailey.substack.com/p/what-exactly-isnt-dbt): | |
As such, we evaluate this choice with an eye towards options for third-party | ||
orchestration and monitoring. | ||
|
||
Another downside is that dbt does not have robust support for the types of | ||
non-SQL scripted transformations we sometimes want to produce, like our | ||
[sales value flagging script](https://github.com/ccao-data/model-sales-val). | ||
There is currently an effort underway to provide better support for [Python | ||
models](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models) | ||
in dbt, but only [three data | ||
platforms](https://docs.getdbt.com/docs/building-a-dbt-project/building-models/python-models#specific-data-platforms) | ||
have been supported since the launch of Python models in late 2022, and there | ||
is [not yet a clear | ||
roadmap](https://github.com/dbt-labs/dbt-core/discussions/5742) for the future | ||
development of Python models. As such, we will need to use a separate system to | ||
keep track of our scripted transformations. We provide a brief sketch of the | ||
design of such a system in the [Tracking raw data and ML | ||
transformations](#tracking-raw-data-and-ml-transformations) section below. | ||
|
||
### Demo | ||
|
||
See the | ||
|
@@ -339,6 +354,80 @@ and validating our data using dbt: | |
failures](https://docs.getdbt.com/reference/resource-configs/store_failures) | ||
and [AWS SNS](https://aws.amazon.com/sns/) for notification management. | ||
|
||
#### Tracking raw data and ML transformations | ||
|
||
* We will keep our raw data extraction scripts separate from the dbt DAG, per | ||
[dbt's recommendation](https://docs.getdbt.com/terms/data-extraction). | ||
* Raw data will be referenced in dbt transformations using the [`source()` | ||
function](https://docs.getdbt.com/reference/dbt-jinja-functions/source). | ||
* [Source freshness | ||
checks](https://docs.getdbt.com/docs/deploy/source-freshness) will be used to | ||
ensure that raw data is updated appropriately prior to transformation. | ||
* Where possible, the intermediate transformations defined in the | ||
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten | ||
in SQL and moved into the dbt DAG. During the transition period, while some | ||
Comment on lines
+366
to
+368
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Transitioning our R transformation scripts to SQL is a big new step that we haven't talked about yet. Based on my read of This also reminds me that we don't really have an issue yet to track the work to transition CTAs to the DAG; perhaps I'll go ahead and open that in the Architecture Upgrades milestone and we can prioritize it later? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for that, I added the CTA ➡️ DAG issue here! #101 |
||
transformations are still written in R, we will treat their output as if it | ||
were raw data and reference it using `source()`. Any transformations that | ||
can't be easily rewritten in SQL will continue to be defined this way in the | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might be useful long-term to rewrite some of them as Python, especially if we expect dbt to add better Python support in the near future. Certainly we can write any new transform scripts that don't work as SQL using Python. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely aligned with this! Should I add it as a note to the doc, or are we fine with keeping this as tacit knowledge? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's keep it as tacit knowledge for the time being. |
||
long term. | ||
* Intermediate or final transformations that require CPU- or memory-intensive | ||
operations like running machine learning models will be defined in Python, | ||
run as AWS Glue jobs, and defined as [ephemeral | ||
models](https://docs.getdbt.com/docs/build/materializations#ephemeral) in the | ||
dbt DAG. This will be true even in cases where the Glue jobs depend on | ||
models produced by the dbt DAG, e.g. the tables produced by | ||
[`model-sales-val`](https://github.com/ccao-data/model-sales-val). A bullet | ||
below will explain how we will manage circular dependencies between these | ||
services. | ||
* Glue jobs will be kept under version control and deployed to AWS using | ||
[Terraform](https://www.terraform.io/) run in GitHub Actions on | ||
commits to their repo's main branch. We will write a reusable [composite | ||
action](https://docs.github.com/en/actions/creating-actions/creating-a-composite-action) | ||
that performs the following operations: | ||
1. Runs `terraform apply` to recreate the Glue job definition in AWS | ||
1. In doing so, inserts the current Git SHA as a command argument in the | ||
Glue job definition so that the job script can read the SHA and use it | ||
for versioning. | ||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. Supports creating staging jobs that we can use for testing during CI. | ||
2. Uploads the newest version of the script to the proper bucket in S3 | ||
* There will be three expected ways in which we handle dependencies between | ||
dbt and Glue, depending on the direction of the dependency graph: | ||
* In cases where dbt depends on the output of a Glue job (Glue -> dbt), we | ||
will treat the Glue job output as an ephemeral model or | ||
a [source](https://docs.getdbt.com/docs/build/sources) in the DAG and | ||
schedule the job as necessary to maintain freshness. | ||
* If we would like to rebuild the dbt models every time the Glue | ||
source data updates, we can schedule the job via GitHub Actions | ||
instead of the Glue job scheduler and configure GitHub Actions to | ||
rerun dbt in case of a successful Glue job run. | ||
* In cases where a Glue job depends on the output of dbt (dbt -> Glue), | ||
we will document the Glue job as an | ||
[exposure](https://docs.getdbt.com/docs/build/exposures) in the DAG. | ||
jeancochrane marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Exposures should make use of the `depends_on` config attribute in order | ||
to properly document the lineage of the data created by Glue. | ||
If we would like to ensure that we run the Glue job every time the | ||
dbt source data updates, we can schedule the Glue job using a GitHub | ||
Actions workflow and configure the workflow to check the dbt state | ||
to see if it needs to be rerun. | ||
* In case of a circular dependency between dbt and Glue (dbt -> Glue -> | ||
dbt), we will document the Glue job as an [ephemeral | ||
model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in | ||
Comment on lines
+413
to
+414
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. question (blocking): Will this work with ephemeral models? We will soon need to include Maybe the sales transformation doesn't apply here since the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You actually can query from ephemeral models in the context of a view! As an example, I tested with two dummy models, one of which is ephemeral and one of which is a view: -- models/iasworld/valclass_ephemeral.sql
{{ config(materialized='ephemeral') }}
select * from {{ source('iasworld', 'valclass') }} -- models/iasworld/valclass_view.sql
{{ config(materialized='view') }}
select * from {{ ref('valclass_ephemeral') }} You can see the definition of the view in Athena under CREATE OR REPLACE VIEW "valclass_view" AS
WITH
__dbt__cte__valclass_ephemeral AS (
SELECT *
FROM
"awsdatacatalog"."iasworld"."valclass"
)
SELECT *
FROM
__dbt__cte__valclass_ephemeral select * from "dev_jecochr_iasworld"."valclass_view" limit 10 I'll delete this dummy view from Athena when I pull in this PR, but in the meantime it's up there in case you'd like to confirm with your own testing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got it, so it just ends up as a queryable CTE. That works for me! |
||
dbt so that we can specify its dependencies using [the `depends_on` | ||
attribute](https://docs.getdbt.com/reference/dbt-jinja-functions/ref#forcing-dependencies). | ||
If we would like to be able to build the entire DAG from scratch, | ||
including running the Glue jobs and transforming their output using | ||
dbt, we can separate the dbt config into two targets, use the second | ||
bullet approach above (dbt -> Glue) to trigger the Glue job once the first | ||
target has completed, and update the dbt wrapper script to initiate | ||
the second dbt target build once the Glue job has completed. | ||
Comment on lines
+417
to
+422
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we should only do this in cases where we absolutely need to be able to build the full DAG from scratch, since it would introduce a lot of complexity into the build process. But I wanted to sketch it out here to make it clear that we should be able to tie together basically any Glue job with our dbt catalog given enough engineering effort. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's reasonable; should we strike this section of the doc until we make a decision on that, then? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fine to keep it for now! Let's wait until one of us actually (hopefully) finds something clever. |
||
* Any such wrapper script should also provide the caller with the option | ||
to skip running the Glue job if the AWS CLI can determine that | ||
the output of the Glue job already exists. | ||
* The opposite circular dependency (Glue -> dbt -> Glue) should not | ||
require a special solution since it is just a combination of the | ||
first and second bullets above (i.e. one Glue job acting as a source | ||
and another acting as an exposure). | ||
|
||
### Pros | ||
|
||
* Designed around software engineering best practices (version control, reproducibility, testing, etc.) | ||
|
@@ -348,7 +437,7 @@ and validating our data using dbt: | |
|
||
### Cons | ||
|
||
* No native support for R scripting as a means of building models, so we would have to either rewrite our raw data extraction scripts or use some kind of hack like running our R scripts from a Python function | ||
* No native support for Python or R scripting as a means of building models, so we can't incorporate our raw data extraction scripts | ||
* We would need to use a [community plugin](https://dbt-athena.github.io/) for Athena support; this plugin is not supported on dbt Cloud, if we ever decided to move to that | ||
* Requires a separate orchestrator for automation, monitoring, and alerting | ||
* Tests currently do not support the same rich documentation descriptions that other entities do (see [this GitHub issue](https://github.com/dbt-labs/dbt-core/issues/2578)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but this looks really neat and I'm extremely curious what is actually stored and how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this will be a useful feature when we start engineering a pipeline to report test output to different stakeholders! I've played around with it a bit; at a high level, each test gets its own view with a set of rows that failed for the test. I think it would take some engineering effort to transform the raw set of rows into a format that highlights the error at hand and helps a person make a decision about how to fix it, but it's nice to know we have this option as we start to think about the notification stream for our tests.