Refactor Athena table definitions to move them into the dbt DAG #111

jeancochrane · 2023-08-31T20:56:25Z

Overview

This PR adds symlinks and dbt model definitions for the Athena tables defined in the aws-athena/ctas/ directory. It also tweaks those table definitions to fit them into the DAG and properly define their lineage.

Additional complexities

Note that this change introduces two new areas of complexity to our dbt environments:

1. More resources to clean up

Since Athena CREATE TABLE AS statements store their output in S3, and since we expect to occasionally rebuild these tables as part of CI runs, we need to add an additional layer on top of the cleanup_dbt_resources.sh script that is run on CI in order to clean up S3 resources that may be created on CI as part of the model building process. In order to handle this, I've added a new lifecycle rule to the ccao-dbt-athena-ci-us-east-1 bucket to delete all old data with the data/ prefix after 30 days, which should be more than enough time to close any PRs that may make use of those data.

Note that I did not add a corresponding lifecycle rule to the ccao-dbt-athena-dev-us-east-1 bucket; my reasoning is that it's more likely that developers will want to persist their development table state over the long term.

2. More complex state selection for local development

These tables are more complicated to keep in the DAG than the views defined in aws-athena/views/, since the processing they perform is CPU intensive and they need to store their table output in S3. We shouldn't need to worry about this on CI since we use deferral and state selection to avoid rebuilding models unless they are new or have changed, but we will need an additional layer of safeguards for local development to ensure that developers who are less experienced with dbt don't accidentally waste time and resources rebuilding these tables unnecessarily.

To this end, we plan to introduce a wrapper script around dbt that beginning dbt developers can use to abstract away the complexity of remote state management.

Closes #101.

…ct subquery

…orkflow

…st_dbt workflow" This reverts commit 5ea671f.

.github/actions/setup_dbt/action.yaml

.github/workflows/build_and_test_dbt.yaml

aws-athena/ctas/location-access.sql

aws-athena/ctas/location-census_acs5.sql

aws-athena/ctas/proximity-dist_pin_to_pin.sql

dbt/dbt_project.yml

dbt/models/default/schema.yml

dbt/profiles.yml

jeancochrane · 2023-09-06T17:34:35Z

dbt/profiles.yml

-      s3_data_dir: s3://ccao-dbt-athena-test-us-east-1/data/
+      s3_staging_dir: s3://ccao-dbt-athena-dev-us-east-1/results/
+      s3_data_dir: s3://ccao-dbt-athena-dev-us-east-1/data/
+      s3_data_naming: schema_table


See the docs for a detailed explanation of the S3 data naming options that dbt-athena exposes to us. schema_table corresponds to {s3_data_dir}/{schema}/{table}/, which matches how production data has historically been stored, so I figured we should move to it over the default schema_table_unique. This config renders the external_location attribute of the CTAS unnecessary.

Nice! This is a great simplification. Big fan.

question (blocking): What is the materialization strategy used here? As in, if someone is working on a branch and creates a new materialized table, then edits their branch and re- materializes it, will the old table be entirely overwritten? Will it delete all the partitions created by the previous materialization?

Under schema_table, new materialized tables will indeed write over the old ones! My thinking was that this was desirable in order to save us on S3 space, but we could also use schema_table_unique in development if we want tables to be persisted between versions.

@jeancochrane Just to call out some potentially unexpected behavior, some quick testing shows that dbt doesn't necessarily remove the entire old materialization:

Here we create a new test table partitioned by year and stored in S3 as parquet:

CREATE TABLE location.test_table WITH ( format = 'Parquet', write_compression = 'SNAPPY', external_location = 's3://ccao-dbt-athena-test-us-east-1/test_table', partitioned_by = ARRAY['year'] ) AS SELECT b23025_007m, year FROM census.acs5

Dropping the table removes the table metadata defining the location and partitions, but not the data itself. All the data still exists in S3.

DROP TABLE location.test_table;

Re-instantiating the table using a subset of the original partitions correctly creates new parquet files for each year after 2017.

CREATE TABLE location.test_table WITH ( format = 'Parquet', write_compression = 'SNAPPY', external_location = 's3://ccao-dbt-athena-test-us-east-1/test_table', partitioned_by = ARRAY['year'] ) AS SELECT b23025_007m, year FROM census.acs5 WHERE year >= '2018' SELECT year, COUNT(*) AS cnt FROM location.test_table GROUP BY year ORDER BY year

row year cnt

1 2018 17521

2 2019 17521

3 2020 17619

4 2021 17619

HOWEVER, the original pre-2018 partitions/parquet files remain in S3, since they aren't overwritten by the new CREATE TABLE nor actually dropped by the DROP TABLE. This doesn't matter too much since those partitions aren't included in the Athena table definition, but we do use a lot of crawlers (which would definitely add those stale partitions back to the table).

I'm not sure this is actually a problem or that there's anything to actually do about it. I just felt the need to call it out since it makes me uneasy and is not how a typical relational DB would act.

Hm, interesting! Thanks for the experiment @dfsnow, I'm going to play around a little bit to see if any of the other materialization options perform more closely to expectation here. I'm guessing you ran the DROP TABLE statement directly in Athena; did you build location.test_table using dbt or using an Athena query? If the former, do you happen to have copies of the dbt models that generated those parquet files?

I didn't build directly with dbt no, but happy to test further! Though I'd be really surprised if dbt-athena had some sort of logic to drop the extra partitions. Either way, let's not consider this blocking, just something to keep in mind.

dfsnow

@jeancochrane See my blocking questions. Mainly concerned about overwriting the current state of the production data + complexity in the dist_pin_to_pin CTAS.

I also have another question unrelated to a specific line: How should we update/re-run these CTAS when the underlying source data is updated? Using a single example, we will soon gather new spatial.park data for 2024. When we want to update the state of dist_pin_to_park using the new data, do we just run dbt locally, targeting prod, and then upload the resulting state manifest back to the master cache location? I'm guessing this falls under the complexity mentioned in point 2 of the OP.

.github/workflows/build_and_test_dbt.yaml

aws-athena/ctas/location-access.sql

aws-athena/ctas/location-census_acs5.sql

aws-athena/ctas/proximity-dist_pin_to_pin.sql

dbt/profiles.yml

dfsnow · 2023-09-11T17:21:08Z

dbt/profiles.yml

-      s3_data_dir: s3://ccao-dbt-athena-test-us-east-1/data/
+      s3_staging_dir: s3://ccao-dbt-athena-dev-us-east-1/results/
+      s3_data_dir: s3://ccao-dbt-athena-dev-us-east-1/data/
+      s3_data_naming: schema_table


Nice! This is a great simplification. Big fan.

question (blocking): What is the materialization strategy used here? As in, if someone is working on a branch and creates a new materialized table, then edits their branch and re- materializes it, will the old table be entirely overwritten? Will it delete all the partitions created by the previous materialization?

dfsnow · 2023-09-11T17:25:59Z

dbt/profiles.yml

+      s3_data_dir: s3://ccao-athena-ctas-us-east-1/
+      s3_data_naming: schema_table


question (blocking): Will merging this branch overwrite the existing production tables? If so, we may want to be cautious about a merge until the dist_pin_to_pin issue is resolved.

It will! See my note in this thread above for a suggestion of how we can handle this.

jeancochrane · 2023-09-12T16:51:04Z

I also have another question unrelated to a specific line: How should we update/re-run these CTAS when the underlying source data is updated? Using a single example, we will soon gather new spatial.park data for 2024. When we want to update the state of dist_pin_to_park using the new data, do we just run dbt locally, targeting prod, and then upload the resulting state manifest back to the master cache location? I'm guessing this falls under the complexity mentioned in point 2 of the OP.

This is a great question @dfsnow! I think there are a few different ways we could approach this, each of which incurs a tradeoff between convenience and engineering effort:

Approach	Convenience	Engineering effort
Use source freshness and the `source_status` selector in a dispatchable GitHub Actions workflow to rebuild models whose sources have been updated	High	High
Write a dispatchable GitHub Actions workflow that accepts a list of models to rebuild as an input variable	Medium	Medium
Add docs to the README instructing devs to download remote state, figure out which models are downstream of the updated sources, rebuild downstream models using `--select`, and upload remote state	Low	Low

The "medium" approach seems like the best option to me, since it's not substantially more effort than testing and documenting the commands locally and I could see it being useful outside of this particular use case. What do you think? If that sounds good to you I can open up an issue for it!

dfsnow · 2023-09-13T16:27:00Z

The "medium" approach seems like the best option to me, since it's not substantially more effort than testing and documenting the commands locally and I could see it being useful outside of this particular use case. What do you think? If that sounds good to you I can open up an issue for it!

@jeancochrane Let's do both the medium and low approaches, since as you said the medium approach isn't substantially more effort than the low, and I think it would be useful to have the commands documented somewhere.

jeancochrane · 2023-09-14T18:50:27Z

Makes sense to me @dfsnow, I opened a follow-up issue here! #128

Refactor Athena table definitions to move them into the dbt DAG

a21f513

jeancochrane linked an issue Aug 31, 2023 that may be closed by this pull request

[Data catalog] Move CTAs into dbt DAG #101

Closed

jeancochrane mentioned this pull request Aug 31, 2023

Add dbt wrapper script to facilitate local development #112

Closed

jeancochrane added 5 commits September 1, 2023 10:17

Fix location.census_acs5 model to pull acs5 attributes from the corre…

5b50159

…ct subquery

Bump expiration time for AWS OIDC token in build_and_test_dbt workflow

d75d30e

Temporarily select location.tax for debugging in build_and_test_dbt w…

5ea671f

…orkflow

Bump error threshold for vw_pin_address_no_extra_whitespace

18b4169

Revert "Temporarily select location.tax for debugging in build_and_te…

33e8fa8

…st_dbt workflow" This reverts commit 5ea671f.

jeancochrane commented Sep 6, 2023

View reviewed changes

jeancochrane marked this pull request as ready for review September 6, 2023 17:35

jeancochrane requested a review from a team as a code owner September 6, 2023 17:35

jeancochrane requested a review from dfsnow September 6, 2023 17:35

dfsnow reviewed Sep 11, 2023

View reviewed changes

This was referenced Sep 11, 2023

[Data catalog] Fast follow: Fix indentation in CTAS #120

Closed

[Data catalog] Move proximity.dist_pin_to_pin CTAS query into the dbt DAG #122

Closed

jeancochrane force-pushed the jeancochrane/101-data-catalog-move-ctas-into-dbt-dag branch from 574d69d to f843fd6 Compare September 11, 2023 21:36

Revert dist_pin_to_pin to a source

f2afd7d

jeancochrane force-pushed the jeancochrane/101-data-catalog-move-ctas-into-dbt-dag branch from f843fd6 to f2afd7d Compare September 11, 2023 21:40

jeancochrane requested a review from dfsnow September 12, 2023 16:51

dfsnow approved these changes Sep 13, 2023

View reviewed changes

jeancochrane mentioned this pull request Sep 14, 2023

[Data catalog] Write a dispatchable workflow for rebuilding models #128

Closed

jeancochrane mentioned this pull request Sep 14, 2023

[Data catalog] Fast follow: Enable bucket versioning #131

Closed

jeancochrane merged commit 8d9b168 into master Sep 14, 2023
3 checks passed

jeancochrane deleted the jeancochrane/101-data-catalog-move-ctas-into-dbt-dag branch September 14, 2023 22:12

This was referenced Sep 15, 2023

Update dbt docs with more detail on handling table models and upgrade to dbt 1.6 #136

Merged

Remove unnecessary subqueries in CTAS models #137

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Athena table definitions to move them into the dbt DAG #111

Refactor Athena table definitions to move them into the dbt DAG #111

jeancochrane commented Aug 31, 2023 •

edited

Loading

jeancochrane Sep 6, 2023

dfsnow Sep 11, 2023

jeancochrane Sep 11, 2023

dfsnow Sep 11, 2023

jeancochrane Sep 11, 2023

dfsnow Sep 11, 2023

dfsnow left a comment

dfsnow Sep 11, 2023

dfsnow Sep 11, 2023

jeancochrane Sep 11, 2023

jeancochrane commented Sep 12, 2023

dfsnow commented Sep 13, 2023

jeancochrane commented Sep 14, 2023

		s3_data_dir: s3://ccao-athena-ctas-us-east-1/
		s3_data_naming: schema_table

Refactor Athena table definitions to move them into the dbt DAG #111

Refactor Athena table definitions to move them into the dbt DAG #111

Conversation

jeancochrane commented Aug 31, 2023 • edited Loading

Overview

Additional complexities

1. More resources to clean up

2. More complex state selection for local development

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Sep 12, 2023

dfsnow commented Sep 13, 2023

jeancochrane commented Sep 14, 2023

jeancochrane commented Aug 31, 2023 •

edited

Loading