Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dbt cache to remote S3 storage #90

Merged
merged 10 commits into from
Aug 25, 2023

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Aug 22, 2023

This PR adjusts the way that we store our dbt state caches to get around the fact that GitHub Actions caches cannot be updated once they've been created. By moving our state caches to S3, we can ensure that the cache always gets updated at the end of every successful run of the build-and-test-dbt workflow, which will allow subsequent runs to only build and test resources that have been modified since the last successful run.

See this run for evidence of the remote cache being used successfully.

Closes #89.

@jeancochrane jeancochrane linked an issue Aug 22, 2023 that may be closed by this pull request
@jeancochrane jeancochrane force-pushed the jeancochrane/89-fix-broken-cache-restoration branch from e89d955 to 635fcb3 Compare August 22, 2023 19:31
@jeancochrane jeancochrane force-pushed the jeancochrane/89-fix-broken-cache-restoration branch from 1c792cf to 62179f1 Compare August 24, 2023 19:52
@jeancochrane jeancochrane changed the title Fix broken dbt cache restoration by saving cache to the same path Move dbt cache to remote S3 storage Aug 24, 2023
key:
description: The cache key to query for.
required: true
restore-key:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub Action that this action is based on accepts restore-keys as a newline-delimited list, but for our purposes I only ever expect we'll need to use one fallback (for the master cache), so we gain some simplicity by switching to accepting a string value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm always in favor of more simplicity.

bucket:
description: The S3 bucket that should store the cache.
required: false
default: ccao-dbt-cache-us-east-1
Copy link
Contributor Author

@jeancochrane jeancochrane Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be a problem to have this bucket name be public; the bucket itself is not public, and there shouldn't be anything sensitive in the dbt manifest file even if someone did manage to force their way in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.

- name: Check for a cache key match
id: query-cache-keys
run: |
if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this simple if ... then conditional will fail on any nonzero status from aws s3api head-object, not just the expected 404. I think this is acceptable, since we want to continue without the cache in case of an unexpected error (e.g. a 403 forbidden due to bad credentials), and testing confirms that the aws s3api head-object call will print the exit status to the console in these cases so we should be able to debug them.

An error occurred (403) when calling the HeadObject operation: Forbidden

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice, this is perfect. Super nice simplifications all around.

# In case of a cache hit, copy the contents of the cache to the path
- if: steps.query-cache-keys.outputs.cache-hit == 'true'
name: Copy cache to path
run: aws s3 cp "s3://$BUCKET/$KEY/manifest.json" "$CACHE_PATH/manifest.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support listing, pushing, and pulling objects from this bucket, I adjusted policy for the IAM role that the GitHub Actions runner assumes to give it limited read/write access to the ccao-dbt-cache-us-east-1 bucket.

using: composite
steps:
- name: Save dbt cache
run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remote storage naming mostly works well, but in pull requests where the branch name includes a slash, it results in some potentially confusing nested folder behavior:

upload: dbt/target/manifest.json to s3://ccao-dbt-cache-us-east-1/dbt-cache-jeancochrane/89-fix-broken-cache-restoration/manifest.json

My guess is that this is probably fine, since the cache bucket will mostly be used by bots and it might even be nice to have higher-level organization (e.g. anything in the dbt-cache-jeancochrane subfolder likely corresponds to a cache for a branch that I authored).

I also wonder if we want to add a lifecycle configuration to this bucket? It'll need a bit of research, since we'll want the dbt-cache-master subfolder to persist indefinitely while everything else can be deleted after a period of inactivity. Curious what you think about all this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is totally fine, since none of us are likely to use this bucket manually.

suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.

Copy link
Contributor Author

@jeancochrane jeancochrane Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I had to adjust the cache keys to create a consistent prefix for PR caches that we could use in the lifecycle policy (pr-caches/), but I did that in 74031b6 and added a policy. Does this look right to you? The difference between expiration and deletion is a bit over my head, so perhaps it needs a little tweaking:

image

Plus, some proof that the policy works for PR runs:

"Expiration": "expiry-date=\"Mon, 25 Sep 2023 00:00:00 GMT\", rule-id=\"delete-old-pr-caches\"",

Luckily, dbt-athena creates the query results bucket with an existing lifecycle policy that deletes all query results after 3 days -- nice!

Screenshot 2023-08-25 135118

Copy link
Member

@dfsnow dfsnow Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane That policy looks good to me!

suggestion (non-blocking): ccao-athena-results-us-east-1 is the bucket I setup for general Athena output. I also setup that lifecycle rule. I was talking about ccao-dbt-athena-ci-us-east-1 and ccao-dbt-athena-test-us-east-1, neither of which have lifecycle rules. IMO we probably should setup lifecycle rules for those as well. I think we're good to merge this though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad @dfsnow, I totally misread the bucket name 🤦🏻‍♀️ I went ahead and copied that policy to ccao-dbt-athena-ci-us-east-1 and ccao-dbt-athena-test-us-east-1!

steps.cache.outputs.cache-matched-key == format(
'{0}-master', env.CACHE_NAME
)
- if: steps.cache.outputs.cache-hit == 'true'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice benefit of this change is that we can implement a definition of cache-hit that makes more sense to me (i.e. it's still a hit even if we restore from the fallback cache)!

@jeancochrane jeancochrane marked this pull request as ready for review August 24, 2023 22:35
@jeancochrane jeancochrane requested a review from a team as a code owner August 24, 2023 22:35
@jeancochrane jeancochrane requested review from dfsnow and wagnerlmichael and removed request for wagnerlmichael August 24, 2023 22:35
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @jeancochrane! Love that this actually simplifies things in many ways and that we can now more easily fetch the manifest for local testing if needed. Merge at your leisure!

key:
description: The cache key to query for.
required: true
restore-key:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm always in favor of more simplicity.

bucket:
description: The S3 bucket that should store the cache.
required: false
default: ccao-dbt-cache-us-east-1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.

- name: Check for a cache key match
id: query-cache-keys
run: |
if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice, this is perfect. Super nice simplifications all around.

using: composite
steps:
- name: Save dbt cache
run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is totally fine, since none of us are likely to use this bucket manually.

suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.

@jeancochrane jeancochrane force-pushed the jeancochrane/89-fix-broken-cache-restoration branch from 5beba2e to 74031b6 Compare August 25, 2023 18:50
@jeancochrane jeancochrane merged commit c3e8922 into master Aug 25, 2023
3 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/89-fix-broken-cache-restoration branch August 25, 2023 20:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix broken cache restoration
2 participants