Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

jeancochrane · 2023-08-18T18:40:41Z

This PR updates the data catalog design doc with my thoughts on a proposed design for integrating our Glue jobs and our R extraction/transformation scripts into our dbt catalog. While Glue and R do not have native integrations into dbt that would allow us to manage our entire data pipeline with one, I think it will only take a relatively small amount of engineering to set up dev workflows where these three tools can interact well and understand how to use one another's output.

Closes #83.

… dbt and Glue

jeancochrane · 2023-08-18T19:44:26Z

documentation/design-docs/data-catalog.md

+* Where possible, the intermediate transformations defined in the
+  `aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
+  in SQL and moved into the dbt DAG. During the transition period, while some


Transitioning our R transformation scripts to SQL is a big new step that we haven't talked about yet. Based on my read of aws-s3/scripts-ccao-data-warehouse-us-east-1, it seems like everything can either be transitioned to SQL or comfortably considered a "source" instead of a data transformation; does that match your understanding? It will be a lift to move all this stuff but I think our future selves will thank us if we can move as much stuff as possible into the DAG.

This also reminds me that we don't really have an issue yet to track the work to transition CTAs to the DAG; perhaps I'll go ahead and open that in the Architecture Upgrades milestone and we can prioritize it later?

I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1 are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!

Thanks for that, I added the CTA ➡️ DAG issue here! #101

documentation/design-docs/data-catalog.md

…rces from exposures

jeancochrane · 2023-08-22T19:06:52Z

documentation/design-docs/data-catalog.md

+      If we would like to be able to build the entire DAG from scratch,
+      including running the Glue jobs and transforming their output using
+      dbt, we can separate the dbt config into two targets, use the second
+      bullet approach above (dbt -> Glue) to trigger the Glue job once the first
+      target has completed, and update the dbt wrapper script to initiate
+      the second dbt target build once the Glue job has completed.


I think we should only do this in cases where we absolutely need to be able to build the full DAG from scratch, since it would introduce a lot of complexity into the build process. But I wanted to sketch it out here to make it clear that we should be able to tie together basically any Glue job with our dbt catalog given enough engineering effort.

IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.

That's reasonable; should we strike this section of the doc until we make a decision on that, then?

Fine to keep it for now! Let's wait until one of us actually (hopefully) finds something clever.

dfsnow

Great work @jeancochrane. I generally agree with everything outlined here and will continue to think about any clever solutions we can use to simplify things a bit.

See my blocking comments for changes.

dfsnow · 2023-08-24T23:45:14Z

documentation/design-docs/data-catalog.md

@@ -339,6 +354,79 @@ and validating our data using dbt:
  failures](https://docs.getdbt.com/reference/resource-configs/store_failures)


Unrelated to this PR, but this looks really neat and I'm extremely curious what is actually stored and how.

Yeah, I think this will be a useful feature when we start engineering a pipeline to report test output to different stakeholders! I've played around with it a bit; at a high level, each test gets its own view with a set of rows that failed for the test. I think it would take some engineering effort to transform the raw set of rows into a format that highlights the error at hand and helps a person make a decision about how to fix it, but it's nice to know we have this option as we start to think about the notification stream for our tests.

dfsnow · 2023-08-25T00:28:38Z

documentation/design-docs/data-catalog.md

+* Where possible, the intermediate transformations defined in the
+  `aws-s3/scripts-ccao-data-warehouse-us-east-1` subdirectory will be rewritten
+  in SQL and moved into the dbt DAG. During the transition period, while some


I agree that this is the correct direction to go, but it's going to require a ton of work and probably be much harder than we think. Many of the scripts in aws-s3/scripts-ccao-data-warehouse-us-east-1 are doing complex transformations, using APIs, or doing ML of some sort, all of which would be nearly impossible to replicate in SQL. There's also a mix of scripts that are loading data from outside sources (via our own raw S3 bucket). I made a preliminary issue for tracking the warehouse script work here: #99, will leave the CTA --> DAG issue creation to you!

dfsnow · 2023-08-25T00:30:10Z

documentation/design-docs/data-catalog.md

+  in SQL and moved into the dbt DAG. During the transition period, while some
+  transformations are still written in R, we will treat their output as if it
+  were raw data and reference it using `source()`. Any transformations that
+  can't be easily rewritten in SQL will continue to be defined this way in the


It might be useful long-term to rewrite some of them as Python, especially if we expect dbt to add better Python support in the near future. Certainly we can write any new transform scripts that don't work as SQL using Python.

Definitely aligned with this! Should I add it as a note to the doc, or are we fine with keeping this as tacit knowledge?

Let's keep it as tacit knowledge for the time being.

documentation/design-docs/data-catalog.md

dfsnow · 2023-08-25T00:51:56Z

documentation/design-docs/data-catalog.md

+      If we would like to ensure that we run the Glue job every time the
+      dbt source data updates, we can write a wrapper script around `dbt run`
+      that uses the Glue `StartJobRun` API
+      ([docs](https://docs.aws.amazon.com/glue/latest/webapi/API_StartJobRun.html))
+      to trigger a job run once the dbt build completes successfully.


suggestion: I don't love this. Maybe we can instead integrate it with Actions and query the dbt state to trigger the job? Unsure what's possible here or how often this will really come up.

I think that makes sense! It seems like it just depends on whether we want dbt to be coupled to the Glue job, or the Glue job to be coupled to dbt; given the dependency order here I agree that it probably makes sense to couple Glue to dbt, since dbt doesn't need to care about how its downstream consumers consume it. I adjusted this note in da2d0ac, but I'm up for tweaking it further if we come up with a better solution.

documentation/design-docs/data-catalog.md

dfsnow · 2023-08-25T00:55:13Z

documentation/design-docs/data-catalog.md

+      If we would like to be able to build the entire DAG from scratch,
+      including running the Glue jobs and transforming their output using
+      dbt, we can separate the dbt config into two targets, use the second
+      bullet approach above (dbt -> Glue) to trigger the Glue job once the first
+      target has completed, and update the dbt wrapper script to initiate
+      the second dbt target build once the Glue job has completed.


IMO we should avoid doing this if at all possible. I'll try to think more about some clever ways to get around this.

dfsnow · 2023-08-25T00:57:52Z

documentation/design-docs/data-catalog.md

+      dbt), we will document the Glue job as an [ephemeral
+      model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in


question (blocking): Will this work with ephemeral models? We will soon need to include sale.flag in default.vw_pin_sale (see #97), but ephemeral models can't be queried directly if I'm reading the docs correctly.

Maybe the sales transformation doesn't apply here since the sale.flag output is reincorporated into the DAG automatically via inclusion in a view?

You actually can query from ephemeral models in the context of a view! As an example, I tested with two dummy models, one of which is ephemeral and one of which is a view:

-- models/iasworld/valclass_ephemeral.sql {{ config(materialized='ephemeral') }} select * from {{ source('iasworld', 'valclass') }}

-- models/iasworld/valclass_view.sql {{ config(materialized='view') }} select * from {{ ref('valclass_ephemeral') }}

You can see the definition of the view in Athena under dev_jecochr_iasworld.valclass_view and test a query against it:

CREATE OR REPLACE VIEW "valclass_view" AS WITH __dbt__cte__valclass_ephemeral AS ( SELECT * FROM "awsdatacatalog"."iasworld"."valclass" ) SELECT * FROM __dbt__cte__valclass_ephemeral

select * from "dev_jecochr_iasworld"."valclass_view" limit 10

I'll delete this dummy view from Athena when I pull in this PR, but in the meantime it's up there in case you'd like to confirm with your own testing.

Got it, so it just ends up as a queryable CTE. That works for me!

dfsnow

Ready to go @jeancochrane! I'm still slightly uneasy about the dbt -> Glue relationship but I think this looks great for now. Will continue to think about better ways to manage all the Glue stuff.

Update data catalog design doc to sketch out the relationship between…

ef5afdd

… dbt and Glue

jeancochrane force-pushed the jeancochrane/document-glue-devops-design branch from 1342737 to ef5afdd Compare August 18, 2023 18:44

jeancochrane commented Aug 18, 2023

View reviewed changes

jeancochrane linked an issue Aug 18, 2023 that may be closed by this pull request

Look into solutions for incorporating ratio reporting and sales val flagging into data catalog #83

Closed

jeancochrane added 2 commits August 22, 2023 13:49

Merge branch 'master' into jeancochrane/document-glue-devops-design

c3f2b31

Update Glue strategy in data-catalog.md design doc to distinguish sou…

3d250fa

…rces from exposures

jeancochrane commented Aug 22, 2023

View reviewed changes

jeancochrane marked this pull request as ready for review August 22, 2023 19:09

jeancochrane requested a review from a team as a code owner August 22, 2023 19:09

jeancochrane requested review from dfsnow and wagnerlmichael August 22, 2023 19:09

dfsnow requested changes Aug 25, 2023

View reviewed changes

jeancochrane added 3 commits August 25, 2023 11:13

Fix typo in ephemeral models link in data-catalog.md

b92c3eb

Fix small typos in data-catalog.md

a04e4f5

Adjust note on managing dbt -> Glue dependencies in data-catalog.md

da2d0ac

jeancochrane requested a review from dfsnow August 25, 2023 16:49

dfsnow approved these changes Aug 25, 2023

View reviewed changes

dfsnow self-requested a review August 25, 2023 20:12

dfsnow approved these changes Aug 25, 2023

View reviewed changes

jeancochrane merged commit 5603057 into master Aug 25, 2023
3 checks passed

jeancochrane deleted the jeancochrane/document-glue-devops-design branch August 25, 2023 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

jeancochrane commented Aug 18, 2023 •

edited

Loading

jeancochrane Aug 18, 2023

dfsnow Aug 25, 2023

jeancochrane Aug 25, 2023

jeancochrane Aug 22, 2023

dfsnow Aug 25, 2023

jeancochrane Aug 25, 2023

dfsnow Aug 25, 2023

dfsnow left a comment

dfsnow Aug 24, 2023

jeancochrane Aug 25, 2023

dfsnow Aug 25, 2023

dfsnow Aug 25, 2023

jeancochrane Aug 25, 2023

dfsnow Aug 25, 2023

dfsnow Aug 25, 2023

jeancochrane Aug 25, 2023

dfsnow Aug 25, 2023

dfsnow Aug 25, 2023

jeancochrane Aug 25, 2023

dfsnow Aug 25, 2023

dfsnow left a comment

		@@ -339,6 +354,79 @@ and validating our data using dbt:
		failures](https://docs.getdbt.com/reference/resource-configs/store_failures)

		dbt), we will document the Glue job as an [ephemeral
		model](https://docs.getdbt.com/docs/build/materializations#ephemeral) in

Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

Update data catalog design doc to explain how non-SQL scripting fits into the catalog #81

Conversation

jeancochrane commented Aug 18, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

jeancochrane commented Aug 18, 2023 •

edited

Loading