Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

Merged
merged 6 commits into from
Aug 25, 2023

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Aug 18, 2023

This PR updates the data catalog design doc with my thoughts on a proposed design for integrating our Glue jobs and our R extraction/transformation scripts into our dbt catalog. While Glue and R do not have native integrations into dbt that would allow us to manage our entire data pipeline with one, I think it will only take a relatively small amount of engineering to set up dev workflows where these three tools can interact well and understand how to use one another's output.

Closes #83.

@jeancochrane jeancochrane force-pushed the jeancochrane/document-glue-devops-design branch from 1342737 to ef5afdd Compare August 18, 2023 18:44
Comment on lines +366 to +368
* Where possible, the intermediate transformations defined in the
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
in SQL and moved into the dbt DAG. During the transition period, while some
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transitioning our R transformation scripts to SQL is a big new step that we haven't talked about yet. Based on my read of aws-s3/scripts-ccao-data-warehouse-us-east-1, it seems like everything can either be transitioned to SQL or comfortably considered a "source" instead of a data transformation; does that match your understanding? It will be a lift to move all this stuff but I think our future selves will thank us if we can move as much stuff as possible into the DAG.

This also reminds me that we don't really have an issue yet to track the work to transition CTAs to the DAG; perhaps I'll go ahead and open that in the Architecture Upgrades milestone and we can prioritize it later?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1 are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for that, I added the CTA ➡️ DAG issue here! #101

documentation/design-docs/data-catalog.md Show resolved Hide resolved
Comment on lines +416 to +421
If we would like to be able to build the entire DAG from scratch,
including running the Glue jobs and transforming their output using
dbt, we can separate the dbt config into two targets, use the second
bullet approach above (dbt -> Glue) to trigger the Glue job once the first
target has completed, and update the dbt wrapper script to initiate
the second dbt target build once the Glue job has completed.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only do this in cases where we absolutely need to be able to build the full DAG from scratch, since it would introduce a lot of complexity into the build process. But I wanted to sketch it out here to make it clear that we should be able to tie together basically any Glue job with our dbt catalog given enough engineering effort.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's reasonable; should we strike this section of the doc until we make a decision on that, then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine to keep it for now! Let's wait until one of us actually (hopefully) finds something clever.

@jeancochrane jeancochrane marked this pull request as ready for review August 22, 2023 19:09
@jeancochrane jeancochrane requested a review from a team as a code owner August 22, 2023 19:09
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @jeancochrane. I generally agree with everything outlined here and will continue to think about any clever solutions we can use to simplify things a bit.

See my blocking comments for changes.

@@ -339,6 +354,79 @@ and validating our data using dbt:
failures](https://docs.getdbt.com/reference/resource-configs/store_failures)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but this looks really neat and I'm extremely curious what is actually stored and how.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this will be a useful feature when we start engineering a pipeline to report test output to different stakeholders! I've played around with it a bit; at a high level, each test gets its own view with a set of rows that failed for the test. I think it would take some engineering effort to transform the raw set of rows into a format that highlights the error at hand and helps a person make a decision about how to fix it, but it's nice to know we have this option as we start to think about the notification stream for our tests.

Comment on lines +366 to +368
* Where possible, the intermediate transformations defined in the
`aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
in SQL and moved into the dbt DAG. During the transition period, while some
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1 are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!

in SQL and moved into the dbt DAG. During the transition period, while some
transformations are still written in R, we will treat their output as if it
were raw data and reference it using `source()`. Any transformations that
can't be easily rewritten in SQL will continue to be defined this way in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful long-term to rewrite some of them as Python, especially if we expect dbt to add better Python support in the near future. Certainly we can write any new transform scripts that don't work as SQL using Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely aligned with this! Should I add it as a note to the doc, or are we fine with keeping this as tacit knowledge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it as tacit knowledge for the time being.

documentation/design-docs/data-catalog.md Outdated Show resolved Hide resolved
documentation/design-docs/data-catalog.md Show resolved Hide resolved
documentation/design-docs/data-catalog.md Outdated Show resolved Hide resolved
Comment on lines 406 to 410
If we would like to ensure that we run the Glue job every time the
dbt source data updates, we can write a wrapper script around `dbt run`
that uses the Glue `StartJobRun` API
([docs](https://docs.aws.amazon.com/glue/latest/webapi/API_StartJobRun.html))
to trigger a job run once the dbt build completes successfully.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: I don't love this. Maybe we can instead integrate it with Actions and query the dbt state to trigger the job? Unsure what's possible here or how often this will really come up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense! It seems like it just depends on whether we want dbt to be coupled to the Glue job, or the Glue job to be coupled to dbt; given the dependency order here I agree that it probably makes sense to couple Glue to dbt, since dbt doesn't need to care about how its downstream consumers consume it. I adjusted this note in da2d0ac, but I'm up for tweaking it further if we come up with a better solution.

documentation/design-docs/data-catalog.md Show resolved Hide resolved
Comment on lines +416 to +421
If we would like to be able to build the entire DAG from scratch,
including running the Glue jobs and transforming their output using
dbt, we can separate the dbt config into two targets, use the second
bullet approach above (dbt -> Glue) to trigger the Glue job once the first
target has completed, and update the dbt wrapper script to initiate
the second dbt target build once the Glue job has completed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.

Comment on lines +412 to +413
dbt), we will document the Glue job as an [ephemeral
model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Will this work with ephemeral models? We will soon need to include sale.flag in default.vw_pin_sale (see #97), but ephemeral models can't be queried directly if I'm reading the docs correctly.

Maybe the sales transformation doesn't apply here since the sale.flag output is reincorporated into the DAG automatically via inclusion in a view?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually can query from ephemeral models in the context of a view! As an example, I tested with two dummy models, one of which is ephemeral and one of which is a view:

-- models/iasworld/valclass_ephemeral.sql
{{ config(materialized='ephemeral') }}

select * from {{ source('iasworld', 'valclass') }}
-- models/iasworld/valclass_view.sql
{{ config(materialized='view') }}

select * from {{ ref('valclass_ephemeral') }}

You can see the definition of the view in Athena under dev_jecochr_iasworld.valclass_view and test a query against it:

CREATE OR REPLACE VIEW "valclass_view" AS 
WITH
  __dbt__cte__valclass_ephemeral AS (
   SELECT *
   FROM
     "awsdatacatalog"."iasworld"."valclass"
) 
SELECT *
FROM
  __dbt__cte__valclass_ephemeral
select * from "dev_jecochr_iasworld"."valclass_view" limit 10

I'll delete this dummy view from Athena when I pull in this PR, but in the meantime it's up there in case you'd like to confirm with your own testing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so it just ends up as a queryable CTE. That works for me!

@dfsnow dfsnow self-requested a review August 25, 2023 20:12
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to go @jeancochrane! I'm still slightly uneasy about the dbt -> Glue relationship but I think this looks great for now. Will continue to think about better ways to manage all the Glue stuff.

@jeancochrane jeancochrane merged commit 5603057 into master Aug 25, 2023
3 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/document-glue-devops-design branch August 25, 2023 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Look into solutions for incorporating ratio reporting and sales val flagging into data catalog
2 participants