-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move dbt cache to remote S3 storage #90
Move dbt cache to remote S3 storage #90
Conversation
e89d955
to
635fcb3
Compare
1c792cf
to
62179f1
Compare
…for debugging" This reverts commit 62179f1.
key: | ||
description: The cache key to query for. | ||
required: true | ||
restore-key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GitHub Action that this action is based on accepts restore-keys
as a newline-delimited list, but for our purposes I only ever expect we'll need to use one fallback (for the master cache), so we gain some simplicity by switching to accepting a string value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'm always in favor of more simplicity.
bucket: | ||
description: The S3 bucket that should store the cache. | ||
required: false | ||
default: ccao-dbt-cache-us-east-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it should be a problem to have this bucket name be public; the bucket itself is not public, and there shouldn't be anything sensitive in the dbt manifest file even if someone did manage to force their way in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.
- name: Check for a cache key match | ||
id: query-cache-keys | ||
run: | | ||
if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this simple if ... then
conditional will fail on any nonzero status from aws s3api head-object
, not just the expected 404. I think this is acceptable, since we want to continue without the cache in case of an unexpected error (e.g. a 403 forbidden due to bad credentials), and testing confirms that the aws s3api head-object
call will print the exit status to the console in these cases so we should be able to debug them.
An error occurred (403) when calling the HeadObject operation: Forbidden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Nice, this is perfect. Super nice simplifications all around.
# In case of a cache hit, copy the contents of the cache to the path | ||
- if: steps.query-cache-keys.outputs.cache-hit == 'true' | ||
name: Copy cache to path | ||
run: aws s3 cp "s3://$BUCKET/$KEY/manifest.json" "$CACHE_PATH/manifest.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to support listing, pushing, and pulling objects from this bucket, I adjusted policy for the IAM role that the GitHub Actions runner assumes to give it limited read/write access to the ccao-dbt-cache-us-east-1
bucket.
using: composite | ||
steps: | ||
- name: Save dbt cache | ||
run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This remote storage naming mostly works well, but in pull requests where the branch name includes a slash, it results in some potentially confusing nested folder behavior:
upload: dbt/target/manifest.json to s3://ccao-dbt-cache-us-east-1/dbt-cache-jeancochrane/89-fix-broken-cache-restoration/manifest.json
My guess is that this is probably fine, since the cache bucket will mostly be used by bots and it might even be nice to have higher-level organization (e.g. anything in the dbt-cache-jeancochrane
subfolder likely corresponds to a cache for a branch that I authored).
I also wonder if we want to add a lifecycle configuration to this bucket? It'll need a bit of research, since we'll want the dbt-cache-master
subfolder to persist indefinitely while everything else can be deleted after a period of inactivity. Curious what you think about all this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is totally fine, since none of us are likely to use this bucket manually.
suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I had to adjust the cache keys to create a consistent prefix for PR caches that we could use in the lifecycle policy (pr-caches/
), but I did that in 74031b6 and added a policy. Does this look right to you? The difference between expiration and deletion is a bit over my head, so perhaps it needs a little tweaking:
Plus, some proof that the policy works for PR runs:
"Expiration": "expiry-date=\"Mon, 25 Sep 2023 00:00:00 GMT\", rule-id=\"delete-old-pr-caches\"",
Luckily, dbt-athena creates the query results bucket with an existing lifecycle policy that deletes all query results after 3 days -- nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane That policy looks good to me!
suggestion (non-blocking): ccao-athena-results-us-east-1
is the bucket I setup for general Athena output. I also setup that lifecycle rule. I was talking about ccao-dbt-athena-ci-us-east-1
and ccao-dbt-athena-test-us-east-1
, neither of which have lifecycle rules. IMO we probably should setup lifecycle rules for those as well. I think we're good to merge this though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, my bad @dfsnow, I totally misread the bucket name 🤦🏻♀️ I went ahead and copied that policy to ccao-dbt-athena-ci-us-east-1
and ccao-dbt-athena-test-us-east-1
!
steps.cache.outputs.cache-matched-key == format( | ||
'{0}-master', env.CACHE_NAME | ||
) | ||
- if: steps.cache.outputs.cache-hit == 'true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nice benefit of this change is that we can implement a definition of cache-hit
that makes more sense to me (i.e. it's still a hit even if we restore from the fallback cache)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @jeancochrane! Love that this actually simplifies things in many ways and that we can now more easily fetch the manifest for local testing if needed. Merge at your leisure!
key: | ||
description: The cache key to query for. | ||
required: true | ||
restore-key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'm always in favor of more simplicity.
bucket: | ||
description: The S3 bucket that should store the cache. | ||
required: false | ||
default: ccao-dbt-cache-us-east-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.
- name: Check for a cache key match | ||
id: query-cache-keys | ||
run: | | ||
if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: Nice, this is perfect. Super nice simplifications all around.
using: composite | ||
steps: | ||
- name: Save dbt cache | ||
run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this is totally fine, since none of us are likely to use this bucket manually.
suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.
5beba2e
to
74031b6
Compare
This PR adjusts the way that we store our dbt state caches to get around the fact that GitHub Actions caches cannot be updated once they've been created. By moving our state caches to S3, we can ensure that the cache always gets updated at the end of every successful run of the
build-and-test-dbt
workflow, which will allow subsequent runs to only build and test resources that have been modified since the last successful run.See this run for evidence of the remote cache being used successfully.
Closes #89.