-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Athena table definitions to move them into the dbt DAG #111
Refactor Athena table definitions to move them into the dbt DAG #111
Conversation
…st_dbt workflow" This reverts commit 5ea671f.
s3_data_dir: s3://ccao-dbt-athena-test-us-east-1/data/ | ||
s3_staging_dir: s3://ccao-dbt-athena-dev-us-east-1/results/ | ||
s3_data_dir: s3://ccao-dbt-athena-dev-us-east-1/data/ | ||
s3_data_naming: schema_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the docs for a detailed explanation of the S3 data naming options that dbt-athena
exposes to us. schema_table
corresponds to {s3_data_dir}/{schema}/{table}/
, which matches how production data has historically been stored, so I figured we should move to it over the default schema_table_unique
. This config renders the external_location
attribute of the CTAS unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is a great simplification. Big fan.
question (blocking): What is the materialization strategy used here? As in, if someone is working on a branch and creates a new materialized table, then edits their branch and re- materializes it, will the old table be entirely overwritten? Will it delete all the partitions created by the previous materialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under schema_table
, new materialized tables will indeed write over the old ones! My thinking was that this was desirable in order to save us on S3 space, but we could also use schema_table_unique
in development if we want tables to be persisted between versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane Just to call out some potentially unexpected behavior, some quick testing shows that dbt doesn't necessarily remove the entire old materialization:
Here we create a new test table partitioned by year and stored in S3 as parquet:
CREATE TABLE location.test_table
WITH (
format = 'Parquet',
write_compression = 'SNAPPY',
external_location = 's3://ccao-dbt-athena-test-us-east-1/test_table',
partitioned_by = ARRAY['year']
) AS
SELECT b23025_007m, year
FROM census.acs5
Dropping the table removes the table metadata defining the location and partitions, but not the data itself. All the data still exists in S3.
DROP TABLE location.test_table;
Re-instantiating the table using a subset of the original partitions correctly creates new parquet files for each year after 2017.
CREATE TABLE location.test_table
WITH (
format = 'Parquet',
write_compression = 'SNAPPY',
external_location = 's3://ccao-dbt-athena-test-us-east-1/test_table',
partitioned_by = ARRAY['year']
) AS
SELECT b23025_007m, year
FROM census.acs5
WHERE year >= '2018'
SELECT year, COUNT(*) AS cnt
FROM location.test_table
GROUP BY year
ORDER BY year
row | year | cnt |
---|---|---|
1 | 2018 | 17521 |
2 | 2019 | 17521 |
3 | 2020 | 17619 |
4 | 2021 | 17619 |
HOWEVER, the original pre-2018 partitions/parquet files remain in S3, since they aren't overwritten by the new CREATE TABLE
nor actually dropped by the DROP TABLE
. This doesn't matter too much since those partitions aren't included in the Athena table definition, but we do use a lot of crawlers (which would definitely add those stale partitions back to the table).
I'm not sure this is actually a problem or that there's anything to actually do about it. I just felt the need to call it out since it makes me uneasy and is not how a typical relational DB would act.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting! Thanks for the experiment @dfsnow, I'm going to play around a little bit to see if any of the other materialization options perform more closely to expectation here. I'm guessing you ran the DROP TABLE
statement directly in Athena; did you build location.test_table
using dbt or using an Athena query? If the former, do you happen to have copies of the dbt models that generated those parquet files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't build directly with dbt no, but happy to test further! Though I'd be really surprised if dbt-athena
had some sort of logic to drop the extra partitions. Either way, let's not consider this blocking, just something to keep in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane See my blocking questions. Mainly concerned about overwriting the current state of the production data + complexity in the dist_pin_to_pin
CTAS.
I also have another question unrelated to a specific line: How should we update/re-run these CTAS when the underlying source data is updated? Using a single example, we will soon gather new spatial.park
data for 2024. When we want to update the state of dist_pin_to_park
using the new data, do we just run dbt locally, targeting prod
, and then upload the resulting state manifest back to the master cache location? I'm guessing this falls under the complexity mentioned in point 2 of the OP.
s3_data_dir: s3://ccao-dbt-athena-test-us-east-1/data/ | ||
s3_staging_dir: s3://ccao-dbt-athena-dev-us-east-1/results/ | ||
s3_data_dir: s3://ccao-dbt-athena-dev-us-east-1/data/ | ||
s3_data_naming: schema_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is a great simplification. Big fan.
question (blocking): What is the materialization strategy used here? As in, if someone is working on a branch and creates a new materialized table, then edits their branch and re- materializes it, will the old table be entirely overwritten? Will it delete all the partitions created by the previous materialization?
s3_data_dir: s3://ccao-athena-ctas-us-east-1/ | ||
s3_data_naming: schema_table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (blocking): Will merging this branch overwrite the existing production tables? If so, we may want to be cautious about a merge until the dist_pin_to_pin
issue is resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will! See my note in this thread above for a suggestion of how we can handle this.
574d69d
to
f843fd6
Compare
f843fd6
to
f2afd7d
Compare
This is a great question @dfsnow! I think there are a few different ways we could approach this, each of which incurs a tradeoff between convenience and engineering effort:
The "medium" approach seems like the best option to me, since it's not substantially more effort than testing and documenting the commands locally and I could see it being useful outside of this particular use case. What do you think? If that sounds good to you I can open up an issue for it! |
@jeancochrane Let's do both the medium and low approaches, since as you said the medium approach isn't substantially more effort than the low, and I think it would be useful to have the commands documented somewhere. |
Overview
This PR adds symlinks and dbt model definitions for the Athena tables defined in the
aws-athena/ctas/
directory. It also tweaks those table definitions to fit them into the DAG and properly define their lineage.Additional complexities
Note that this change introduces two new areas of complexity to our dbt environments:
1. More resources to clean up
Since Athena
CREATE TABLE AS
statements store their output in S3, and since we expect to occasionally rebuild these tables as part of CI runs, we need to add an additional layer on top of thecleanup_dbt_resources.sh
script that is run on CI in order to clean up S3 resources that may be created on CI as part of the model building process. In order to handle this, I've added a new lifecycle rule to theccao-dbt-athena-ci-us-east-1
bucket to delete all old data with thedata/
prefix after 30 days, which should be more than enough time to close any PRs that may make use of those data.Note that I did not add a corresponding lifecycle rule to the
ccao-dbt-athena-dev-us-east-1
bucket; my reasoning is that it's more likely that developers will want to persist their development table state over the long term.2. More complex state selection for local development
These tables are more complicated to keep in the DAG than the views defined in
aws-athena/views/
, since the processing they perform is CPU intensive and they need to store their table output in S3. We shouldn't need to worry about this on CI since we use deferral and state selection to avoid rebuilding models unless they are new or have changed, but we will need an additional layer of safeguards for local development to ensure that developers who are less experienced with dbt don't accidentally waste time and resources rebuilding these tables unnecessarily.To this end, we plan to introduce a wrapper script around dbt that beginning dbt developers can use to abstract away the complexity of remote state management.
Closes #101.