-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move dbt cache to remote S3 storage #90
Changes from all commits
635fcb3
12922c4
455f22d
addcb58
c047d95
62179f1
313852a
7530947
92b6e80
74031b6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
name: Restore dbt cache | ||
description: Attempts to restore dbt cache from S3 storage. | ||
inputs: | ||
path: | ||
description: The local path to restore the cache to in case of a hit. | ||
required: true | ||
key: | ||
description: The cache key to query for. | ||
required: true | ||
restore-key: | ||
description: An additional key to use as a fallback for the cache. | ||
required: true | ||
bucket: | ||
description: The S3 bucket that should store the cache. | ||
required: false | ||
default: ccao-dbt-cache-us-east-1 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it should be a problem to have this bucket name be public; the bucket itself is not public, and there shouldn't be anything sensitive in the dbt manifest file even if someone did manage to force their way in. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names. |
||
outputs: | ||
cache-hit: | ||
description: >- | ||
Boolean indicating whether a match was found for the cache key. | ||
value: ${{ steps.query-cache-keys.outputs.cache-hit }} | ||
exact-match: | ||
description: >- | ||
Boolean indicating whether a cache hit was an exact match. Always | ||
false if cache-hit is false. | ||
value: ${{ steps.query-cache-keys.outputs.exact-match }} | ||
cache-matched-key: | ||
description: The cache key that matched, if any. Empty if cache-hit is false. | ||
value: ${{ steps.query-cache-keys.outputs.cache-matched-key }} | ||
runs: | ||
using: composite | ||
steps: | ||
- name: Check for a cache key match | ||
id: query-cache-keys | ||
run: | | ||
if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that this simple
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. praise: Nice, this is perfect. Super nice simplifications all around. |
||
echo "Cache hit: Found exact match" | ||
{ | ||
echo "cache-hit=true"; | ||
echo "exact-match=true"; | ||
echo "cache-matched-key=$KEY" | ||
} >> $GITHUB_OUTPUT | ||
else | ||
echo "Did not find exact match for cache key, checking fallback" | ||
if aws s3api head-object --bucket "$BUCKET" --key "$RESTORE_KEY/manifest.json"; then | ||
echo "Cache hit: Found fallback match" | ||
{ | ||
echo "cache-hit=true"; | ||
echo "exact-match=false"; | ||
echo "cache-matched-key=$RESTORE_KEY" | ||
} >> $GITHUB_OUTPUT | ||
else | ||
echo "Cache miss: Did not find fallback match for cache key" | ||
{ | ||
echo "cache-hit=false"; | ||
echo "exact-match=false"; | ||
echo "cache-matched-key=''"; | ||
} >> $GITHUB_OUTPUT | ||
fi | ||
fi | ||
shell: bash | ||
env: | ||
KEY: ${{ inputs.key }} | ||
RESTORE_KEY: ${{ inputs.restore-key }} | ||
BUCKET: ${{ inputs.bucket }} | ||
|
||
- if: steps.query-cache-keys.outputs.cache-hit == 'true' | ||
name: Copy cache to path | ||
run: aws s3 cp "s3://$BUCKET/$KEY/manifest.json" "$CACHE_PATH/manifest.json" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In order to support listing, pushing, and pulling objects from this bucket, I adjusted policy for the IAM role that the GitHub Actions runner assumes to give it limited read/write access to the |
||
shell: bash | ||
env: | ||
KEY: ${{ steps.query-cache-keys.outputs.cache-matched-key }} | ||
CACHE_PATH: ${{ inputs.path }} | ||
BUCKET: ${{ inputs.bucket }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
name: Save dbt cache | ||
description: Updates dbt cache using S3 storage. | ||
inputs: | ||
path: | ||
description: The local path to the state dir to upload as the new cache. | ||
required: true | ||
key: | ||
description: The key to use for the cache. | ||
required: true | ||
bucket: | ||
description: The S3 bucket that should store the cache. | ||
required: false | ||
default: ccao-dbt-cache-us-east-1 | ||
runs: | ||
using: composite | ||
steps: | ||
- name: Save dbt cache | ||
run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This remote storage naming mostly works well, but in pull requests where the branch name includes a slash, it results in some potentially confusing nested folder behavior:
My guess is that this is probably fine, since the cache bucket will mostly be used by bots and it might even be nice to have higher-level organization (e.g. anything in the I also wonder if we want to add a lifecycle configuration to this bucket? It'll need a bit of research, since we'll want the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO this is totally fine, since none of us are likely to use this bucket manually. suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool! I had to adjust the cache keys to create a consistent prefix for PR caches that we could use in the lifecycle policy ( Plus, some proof that the policy works for PR runs:
Luckily, dbt-athena creates the query results bucket with an existing lifecycle policy that deletes all query results after 3 days -- nice! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jeancochrane That policy looks good to me! suggestion (non-blocking): There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops, my bad @dfsnow, I totally misread the bucket name 🤦🏻♀️ I went ahead and copied that policy to |
||
shell: bash | ||
env: | ||
KEY: ${{ inputs.key }} | ||
CACHE_PATH: ${{ inputs.path }} | ||
BUCKET: ${{ inputs.bucket }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,3 @@ | ||
CACHE_NAME=dbt-cache | ||
PROJECT_DIR=dbt | ||
STATE_DIR=state | ||
TARGET_DIR=target |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -23,35 +23,20 @@ jobs: | |
with: | ||
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }} | ||
|
||
# We have to use the separate `restore`/`save` actions instead of the | ||
# unified `cache` action because only `restore` provides access to the | ||
# `cache-matched-key` and `cache-primary-key` outputs as of v3 | ||
- name: Restore dbt state cache | ||
id: cache | ||
uses: actions/cache/restore@v3 | ||
uses: ./.github/actions/restore_dbt_cache | ||
with: | ||
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }} | ||
key: ${{ env.CACHE_NAME }}-${{ env.CACHE_KEY }} | ||
restore-keys: | | ||
${{ env.CACHE_NAME }}-master | ||
key: ${{ env.CACHE_KEY }} | ||
restore-key: ${{ env.CACHE_RESTORE_KEY }} | ||
|
||
# If we restore the cache from the `restore-keys` key, the `cache-hit` | ||
# output will be 'false' but the `cache-matched-key` output will be | ||
# the name of the `restore-keys` key; we want to count this case as a hit | ||
- if: | | ||
steps.cache.outputs.cache-hit == 'true' || | ||
steps.cache.outputs.cache-matched-key == format( | ||
'{0}-master', env.CACHE_NAME | ||
) | ||
- if: steps.cache.outputs.cache-hit == 'true' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One nice benefit of this change is that we can implement a definition of |
||
name: Set command args to build/test modified resources | ||
run: echo "MODIFIED_RESOURCES_ONLY=true" >> "$GITHUB_ENV" | ||
shell: bash | ||
|
||
- if: | | ||
steps.cache.outputs.cache-hit != 'true' && | ||
steps.cache.outputs.cache-matched-key != format( | ||
'{0}-master', env.CACHE_NAME | ||
) | ||
- if: steps.cache.outputs.cache-hit != 'true' | ||
name: Set command args to build/test all resources | ||
run: echo "MODIFIED_RESOURCES_ONLY=false" >> "$GITHUB_ENV" | ||
shell: bash | ||
|
@@ -86,7 +71,7 @@ jobs: | |
shell: bash | ||
|
||
- name: Update dbt state cache | ||
uses: actions/cache/save@v3 | ||
uses: ./.github/actions/save_dbt_cache | ||
with: | ||
path: ${{ env.PROJECT_DIR }}/${{ env.TARGET_DIR }} | ||
key: ${{ env.CACHE_NAME }}-${{ env.CACHE_KEY }} | ||
key: ${{ env.CACHE_KEY }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GitHub Action that this action is based on accepts
restore-keys
as a newline-delimited list, but for our purposes I only ever expect we'll need to use one fallback (for the master cache), so we gain some simplicity by switching to accepting a string value.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'm always in favor of more simplicity.