-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81
Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81
Conversation
1342737
to
ef5afdd
Compare
* Where possible, the intermediate transformations defined in the | ||
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten | ||
in SQL and moved into the dbt DAG. During the transition period, while some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Transitioning our R transformation scripts to SQL is a big new step that we haven't talked about yet. Based on my read of aws-s3/scripts-ccao-data-warehouse-us-east-1
, it seems like everything can either be transitioned to SQL or comfortably considered a "source" instead of a data transformation; does that match your understanding? It will be a lift to move all this stuff but I think our future selves will thank us if we can move as much stuff as possible into the DAG.
This also reminds me that we don't really have an issue yet to track the work to transition CTAs to the DAG; perhaps I'll go ahead and open that in the Architecture Upgrades milestone and we can prioritize it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1
are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that, I added the CTA ➡️ DAG issue here! #101
If we would like to be able to build the entire DAG from scratch, | ||
including running the Glue jobs and transforming their output using | ||
dbt, we can separate the dbt config into two targets, use the second | ||
bullet approach above (dbt -> Glue) to trigger the Glue job once the first | ||
target has completed, and update the dbt wrapper script to initiate | ||
the second dbt target build once the Glue job has completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only do this in cases where we absolutely need to be able to build the full DAG from scratch, since it would introduce a lot of complexity into the build process. But I wanted to sketch it out here to make it clear that we should be able to tie together basically any Glue job with our dbt catalog given enough engineering effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's reasonable; should we strike this section of the doc until we make a decision on that, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine to keep it for now! Let's wait until one of us actually (hopefully) finds something clever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @jeancochrane. I generally agree with everything outlined here and will continue to think about any clever solutions we can use to simplify things a bit.
See my blocking comments for changes.
@@ -339,6 +354,79 @@ and validating our data using dbt: | |||
failures](https://docs.getdbt.com/reference/resource-configs/store_failures) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but this looks really neat and I'm extremely curious what is actually stored and how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this will be a useful feature when we start engineering a pipeline to report test output to different stakeholders! I've played around with it a bit; at a high level, each test gets its own view with a set of rows that failed for the test. I think it would take some engineering effort to transform the raw set of rows into a format that highlights the error at hand and helps a person make a decision about how to fix it, but it's nice to know we have this option as we start to think about the notification stream for our tests.
* Where possible, the intermediate transformations defined in the | ||
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten | ||
in SQL and moved into the dbt DAG. During the transition period, while some |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1
are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!
in SQL and moved into the dbt DAG. During the transition period, while some | ||
transformations are still written in R, we will treat their output as if it | ||
were raw data and reference it using `source()`. Any transformations that | ||
can't be easily rewritten in SQL will continue to be defined this way in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful long-term to rewrite some of them as Python, especially if we expect dbt to add better Python support in the near future. Certainly we can write any new transform scripts that don't work as SQL using Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely aligned with this! Should I add it as a note to the doc, or are we fine with keeping this as tacit knowledge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it as tacit knowledge for the time being.
If we would like to ensure that we run the Glue job every time the | ||
dbt source data updates, we can write a wrapper script around `dbt run` | ||
that uses the Glue `StartJobRun` API | ||
([docs](https://docs.aws.amazon.com/glue/latest/webapi/API_StartJobRun.html)) | ||
to trigger a job run once the dbt build completes successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: I don't love this. Maybe we can instead integrate it with Actions and query the dbt state to trigger the job? Unsure what's possible here or how often this will really come up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense! It seems like it just depends on whether we want dbt to be coupled to the Glue job, or the Glue job to be coupled to dbt; given the dependency order here I agree that it probably makes sense to couple Glue to dbt, since dbt doesn't need to care about how its downstream consumers consume it. I adjusted this note in da2d0ac, but I'm up for tweaking it further if we come up with a better solution.
If we would like to be able to build the entire DAG from scratch, | ||
including running the Glue jobs and transforming their output using | ||
dbt, we can separate the dbt config into two targets, use the second | ||
bullet approach above (dbt -> Glue) to trigger the Glue job once the first | ||
target has completed, and update the dbt wrapper script to initiate | ||
the second dbt target build once the Glue job has completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.
dbt), we will document the Glue job as an [ephemeral | ||
model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): Will this work with ephemeral models? We will soon need to include sale.flag
in default.vw_pin_sale
(see #97), but ephemeral models can't be queried directly if I'm reading the docs correctly.
Maybe the sales transformation doesn't apply here since the sale.flag
output is reincorporated into the DAG automatically via inclusion in a view?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually can query from ephemeral models in the context of a view! As an example, I tested with two dummy models, one of which is ephemeral and one of which is a view:
-- models/iasworld/valclass_ephemeral.sql
{{ config(materialized='ephemeral') }}
select * from {{ source('iasworld', 'valclass') }}
-- models/iasworld/valclass_view.sql
{{ config(materialized='view') }}
select * from {{ ref('valclass_ephemeral') }}
You can see the definition of the view in Athena under dev_jecochr_iasworld.valclass_view
and test a query against it:
CREATE OR REPLACE VIEW "valclass_view" AS
WITH
__dbt__cte__valclass_ephemeral AS (
SELECT *
FROM
"awsdatacatalog"."iasworld"."valclass"
)
SELECT *
FROM
__dbt__cte__valclass_ephemeral
select * from "dev_jecochr_iasworld"."valclass_view" limit 10
I'll delete this dummy view from Athena when I pull in this PR, but in the meantime it's up there in case you'd like to confirm with your own testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so it just ends up as a queryable CTE. That works for me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to go @jeancochrane! I'm still slightly uneasy about the dbt -> Glue relationship but I think this looks great for now. Will continue to think about better ways to manage all the Glue stuff.
This PR updates the data catalog design doc with my thoughts on a proposed design for integrating our Glue jobs and our R extraction/transformation scripts into our dbt catalog. While Glue and R do not have native integrations into dbt that would allow us to manage our entire data pipeline with one, I think it will only take a relatively small amount of engineering to set up dev workflows where these three tools can interact well and understand how to use one another's output.
Closes #83.