Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Athena table definitions to move them into the dbt DAG #111

Merged

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Aug 31, 2023

Overview

This PR adds symlinks and dbt model definitions for the Athena tables defined in the aws-athena/ctas/ directory. It also tweaks those table definitions to fit them into the DAG and properly define their lineage.

Additional complexities

Note that this change introduces two new areas of complexity to our dbt environments:

1. More resources to clean up

Since Athena CREATE TABLE AS statements store their output in S3, and since we expect to occasionally rebuild these tables as part of CI runs, we need to add an additional layer on top of the cleanup_dbt_resources.sh script that is run on CI in order to clean up S3 resources that may be created on CI as part of the model building process. In order to handle this, I've added a new lifecycle rule to the ccao-dbt-athena-ci-us-east-1 bucket to delete all old data with the data/ prefix after 30 days, which should be more than enough time to close any PRs that may make use of those data.

Note that I did not add a corresponding lifecycle rule to the ccao-dbt-athena-dev-us-east-1 bucket; my reasoning is that it's more likely that developers will want to persist their development table state over the long term.

2. More complex state selection for local development

These tables are more complicated to keep in the DAG than the views defined in aws-athena/views/, since the processing they perform is CPU intensive and they need to store their table output in S3. We shouldn't need to worry about this on CI since we use deferral and state selection to avoid rebuilding models unless they are new or have changed, but we will need an additional layer of safeguards for local development to ensure that developers who are less experienced with dbt don't accidentally waste time and resources rebuilding these tables unnecessarily.

To this end, we plan to introduce a wrapper script around dbt that beginning dbt developers can use to abstract away the complexity of remote state management.

Closes #101.

.github/actions/setup_dbt/action.yaml Show resolved Hide resolved
.github/workflows/build_and_test_dbt.yaml Show resolved Hide resolved
aws-athena/ctas/location-access.sql Show resolved Hide resolved
aws-athena/ctas/location-access.sql Show resolved Hide resolved
aws-athena/ctas/location-census_acs5.sql Show resolved Hide resolved
aws-athena/ctas/proximity-dist_pin_to_pin.sql Outdated Show resolved Hide resolved
dbt/dbt_project.yml Show resolved Hide resolved
dbt/models/default/schema.yml Show resolved Hide resolved
dbt/profiles.yml Show resolved Hide resolved
s3_data_dir: s3://ccao-dbt-athena-test-us-east-1/data/
s3_staging_dir: s3://ccao-dbt-athena-dev-us-east-1/results/
s3_data_dir: s3://ccao-dbt-athena-dev-us-east-1/data/
s3_data_naming: schema_table
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the docs for a detailed explanation of the S3 data naming options that dbt-athena exposes to us. schema_table corresponds to {s3_data_dir}/{schema}/{table}/, which matches how production data has historically been stored, so I figured we should move to it over the default schema_table_unique. This config renders the external_location attribute of the CTAS unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is a great simplification. Big fan.

question (blocking): What is the materialization strategy used here? As in, if someone is working on a branch and creates a new materialized table, then edits their branch and re- materializes it, will the old table be entirely overwritten? Will it delete all the partitions created by the previous materialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under schema_table, new materialized tables will indeed write over the old ones! My thinking was that this was desirable in order to save us on S3 space, but we could also use schema_table_unique in development if we want tables to be persisted between versions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane Just to call out some potentially unexpected behavior, some quick testing shows that dbt doesn't necessarily remove the entire old materialization:

Here we create a new test table partitioned by year and stored in S3 as parquet:

CREATE TABLE location.test_table
WITH (
    format = 'Parquet',
    write_compression = 'SNAPPY',
    external_location = 's3://ccao-dbt-athena-test-us-east-1/test_table',
    partitioned_by = ARRAY['year']
) AS
SELECT b23025_007m, year
FROM census.acs5

Dropping the table removes the table metadata defining the location and partitions, but not the data itself. All the data still exists in S3.

DROP TABLE location.test_table;

Re-instantiating the table using a subset of the original partitions correctly creates new parquet files for each year after 2017.

CREATE TABLE location.test_table
WITH (
    format = 'Parquet',
    write_compression = 'SNAPPY',
    external_location = 's3://ccao-dbt-athena-test-us-east-1/test_table',
    partitioned_by = ARRAY['year']
) AS
SELECT b23025_007m, year
FROM census.acs5
WHERE year >= '2018'

SELECT year, COUNT(*) AS cnt
FROM location.test_table
GROUP BY year
ORDER BY year
row year cnt
1 2018 17521
2 2019 17521
3 2020 17619
4 2021 17619

HOWEVER, the original pre-2018 partitions/parquet files remain in S3, since they aren't overwritten by the new CREATE TABLE nor actually dropped by the DROP TABLE. This doesn't matter too much since those partitions aren't included in the Athena table definition, but we do use a lot of crawlers (which would definitely add those stale partitions back to the table).

I'm not sure this is actually a problem or that there's anything to actually do about it. I just felt the need to call it out since it makes me uneasy and is not how a typical relational DB would act.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting! Thanks for the experiment @dfsnow, I'm going to play around a little bit to see if any of the other materialization options perform more closely to expectation here. I'm guessing you ran the DROP TABLE statement directly in Athena; did you build location.test_table using dbt or using an Athena query? If the former, do you happen to have copies of the dbt models that generated those parquet files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't build directly with dbt no, but happy to test further! Though I'd be really surprised if dbt-athena had some sort of logic to drop the extra partitions. Either way, let's not consider this blocking, just something to keep in mind.

@jeancochrane jeancochrane marked this pull request as ready for review September 6, 2023 17:35
@jeancochrane jeancochrane requested a review from a team as a code owner September 6, 2023 17:35
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane See my blocking questions. Mainly concerned about overwriting the current state of the production data + complexity in the dist_pin_to_pin CTAS.

I also have another question unrelated to a specific line: How should we update/re-run these CTAS when the underlying source data is updated? Using a single example, we will soon gather new spatial.park data for 2024. When we want to update the state of dist_pin_to_park using the new data, do we just run dbt locally, targeting prod, and then upload the resulting state manifest back to the master cache location? I'm guessing this falls under the complexity mentioned in point 2 of the OP.

.github/workflows/build_and_test_dbt.yaml Show resolved Hide resolved
aws-athena/ctas/location-access.sql Show resolved Hide resolved
aws-athena/ctas/location-census_acs5.sql Show resolved Hide resolved
aws-athena/ctas/proximity-dist_pin_to_pin.sql Outdated Show resolved Hide resolved
dbt/profiles.yml Show resolved Hide resolved
s3_data_dir: s3://ccao-dbt-athena-test-us-east-1/data/
s3_staging_dir: s3://ccao-dbt-athena-dev-us-east-1/results/
s3_data_dir: s3://ccao-dbt-athena-dev-us-east-1/data/
s3_data_naming: schema_table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! This is a great simplification. Big fan.

question (blocking): What is the materialization strategy used here? As in, if someone is working on a branch and creates a new materialized table, then edits their branch and re- materializes it, will the old table be entirely overwritten? Will it delete all the partitions created by the previous materialization?

Comment on lines +27 to +28
s3_data_dir: s3://ccao-athena-ctas-us-east-1/
s3_data_naming: schema_table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (blocking): Will merging this branch overwrite the existing production tables? If so, we may want to be cautious about a merge until the dist_pin_to_pin issue is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will! See my note in this thread above for a suggestion of how we can handle this.

@jeancochrane jeancochrane force-pushed the jeancochrane/101-data-catalog-move-ctas-into-dbt-dag branch from f843fd6 to f2afd7d Compare September 11, 2023 21:40
@jeancochrane
Copy link
Contributor Author

I also have another question unrelated to a specific line: How should we update/re-run these CTAS when the underlying source data is updated? Using a single example, we will soon gather new spatial.park data for 2024. When we want to update the state of dist_pin_to_park using the new data, do we just run dbt locally, targeting prod, and then upload the resulting state manifest back to the master cache location? I'm guessing this falls under the complexity mentioned in point 2 of the OP.

This is a great question @dfsnow! I think there are a few different ways we could approach this, each of which incurs a tradeoff between convenience and engineering effort:

Approach Convenience Engineering effort
Use source freshness and the source_status selector in a dispatchable GitHub Actions workflow to rebuild models whose sources have been updated High High
Write a dispatchable GitHub Actions workflow that accepts a list of models to rebuild as an input variable Medium Medium
Add docs to the README instructing devs to download remote state, figure out which models are downstream of the updated sources, rebuild downstream models using --select, and upload remote state Low Low

The "medium" approach seems like the best option to me, since it's not substantially more effort than testing and documenting the commands locally and I could see it being useful outside of this particular use case. What do you think? If that sounds good to you I can open up an issue for it!

@dfsnow
Copy link
Member

dfsnow commented Sep 13, 2023

The "medium" approach seems like the best option to me, since it's not substantially more effort than testing and documenting the commands locally and I could see it being useful outside of this particular use case. What do you think? If that sounds good to you I can open up an issue for it!

@jeancochrane Let's do both the medium and low approaches, since as you said the medium approach isn't substantially more effort than the low, and I think it would be useful to have the commands documented somewhere.

@jeancochrane
Copy link
Contributor Author

Makes sense to me @dfsnow, I opened a follow-up issue here! #128

@jeancochrane jeancochrane merged commit 8d9b168 into master Sep 14, 2023
3 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/101-data-catalog-move-ctas-into-dbt-dag branch September 14, 2023 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data catalog] Move CTAs into dbt DAG
2 participants