Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move dbt cache to remote S3 storage #90

Merged
merged 10 commits into from
Aug 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions .github/actions/configure_dbt_environment/action.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,14 +11,16 @@ runs:
echo "On pull request branch, setting dbt env to CI"
{
echo "TARGET=ci";
echo "CACHE_KEY=$GITHUB_HEAD_REF";
echo "CACHE_KEY=pr-caches/$GITHUB_HEAD_REF";
echo "CACHE_RESTORE_KEY=master-cache"
echo "HEAD_REF=$GITHUB_HEAD_REF"
} >> "$GITHUB_ENV"
elif [[ $GITHUB_REF_NAME == 'master' ]]; then
echo "On master branch, setting dbt env to prod"
{
echo "TARGET=prod";
echo "CACHE_KEY=master";
echo "CACHE_KEY=master-cache";
echo "CACHE_RESTORE_KEY=master-cache"
} >> "$GITHUB_ENV"
else
echo "CI context did not match any of the expected environments"
Expand Down
74 changes: 74 additions & 0 deletions .github/actions/restore_dbt_cache/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Restore dbt cache
description: Attempts to restore dbt cache from S3 storage.
inputs:
path:
description: The local path to restore the cache to in case of a hit.
required: true
key:
description: The cache key to query for.
required: true
restore-key:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GitHub Action that this action is based on accepts restore-keys as a newline-delimited list, but for our purposes I only ever expect we'll need to use one fallback (for the master cache), so we gain some simplicity by switching to accepting a string value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'm always in favor of more simplicity.

description: An additional key to use as a fallback for the cache.
required: true
bucket:
description: The S3 bucket that should store the cache.
required: false
default: ccao-dbt-cache-us-east-1
Copy link
Contributor Author

@jeancochrane jeancochrane Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be a problem to have this bucket name be public; the bucket itself is not public, and there shouldn't be anything sensitive in the dbt manifest file even if someone did manage to force their way in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.

outputs:
cache-hit:
description: >-
Boolean indicating whether a match was found for the cache key.
value: ${{ steps.query-cache-keys.outputs.cache-hit }}
exact-match:
description: >-
Boolean indicating whether a cache hit was an exact match. Always
false if cache-hit is false.
value: ${{ steps.query-cache-keys.outputs.exact-match }}
cache-matched-key:
description: The cache key that matched, if any. Empty if cache-hit is false.
value: ${{ steps.query-cache-keys.outputs.cache-matched-key }}
runs:
using: composite
steps:
- name: Check for a cache key match
id: query-cache-keys
run: |
if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this simple if ... then conditional will fail on any nonzero status from aws s3api head-object, not just the expected 404. I think this is acceptable, since we want to continue without the cache in case of an unexpected error (e.g. a 403 forbidden due to bad credentials), and testing confirms that the aws s3api head-object call will print the exit status to the console in these cases so we should be able to debug them.

An error occurred (403) when calling the HeadObject operation: Forbidden

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Nice, this is perfect. Super nice simplifications all around.

echo "Cache hit: Found exact match"
{
echo "cache-hit=true";
echo "exact-match=true";
echo "cache-matched-key=$KEY"
} >> $GITHUB_OUTPUT
else
echo "Did not find exact match for cache key, checking fallback"
if aws s3api head-object --bucket "$BUCKET" --key "$RESTORE_KEY/manifest.json"; then
echo "Cache hit: Found fallback match"
{
echo "cache-hit=true";
echo "exact-match=false";
echo "cache-matched-key=$RESTORE_KEY"
} >> $GITHUB_OUTPUT
else
echo "Cache miss: Did not find fallback match for cache key"
{
echo "cache-hit=false";
echo "exact-match=false";
echo "cache-matched-key=''";
} >> $GITHUB_OUTPUT
fi
fi
shell: bash
env:
KEY: ${{ inputs.key }}
RESTORE_KEY: ${{ inputs.restore-key }}
BUCKET: ${{ inputs.bucket }}

- if: steps.query-cache-keys.outputs.cache-hit == 'true'
name: Copy cache to path
run: aws s3 cp "s3://$BUCKET/$KEY/manifest.json" "$CACHE_PATH/manifest.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to support listing, pushing, and pulling objects from this bucket, I adjusted policy for the IAM role that the GitHub Actions runner assumes to give it limited read/write access to the ccao-dbt-cache-us-east-1 bucket.

shell: bash
env:
KEY: ${{ steps.query-cache-keys.outputs.cache-matched-key }}
CACHE_PATH: ${{ inputs.path }}
BUCKET: ${{ inputs.bucket }}
23 changes: 23 additions & 0 deletions .github/actions/save_dbt_cache/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: Save dbt cache
description: Updates dbt cache using S3 storage.
inputs:
path:
description: The local path to the state dir to upload as the new cache.
required: true
key:
description: The key to use for the cache.
required: true
bucket:
description: The S3 bucket that should store the cache.
required: false
default: ccao-dbt-cache-us-east-1
runs:
using: composite
steps:
- name: Save dbt cache
run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This remote storage naming mostly works well, but in pull requests where the branch name includes a slash, it results in some potentially confusing nested folder behavior:

upload: dbt/target/manifest.json to s3://ccao-dbt-cache-us-east-1/dbt-cache-jeancochrane/89-fix-broken-cache-restoration/manifest.json

My guess is that this is probably fine, since the cache bucket will mostly be used by bots and it might even be nice to have higher-level organization (e.g. anything in the dbt-cache-jeancochrane subfolder likely corresponds to a cache for a branch that I authored).

I also wonder if we want to add a lifecycle configuration to this bucket? It'll need a bit of research, since we'll want the dbt-cache-master subfolder to persist indefinitely while everything else can be deleted after a period of inactivity. Curious what you think about all this!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is totally fine, since none of us are likely to use this bucket manually.

suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.

Copy link
Contributor Author

@jeancochrane jeancochrane Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! I had to adjust the cache keys to create a consistent prefix for PR caches that we could use in the lifecycle policy (pr-caches/), but I did that in 74031b6 and added a policy. Does this look right to you? The difference between expiration and deletion is a bit over my head, so perhaps it needs a little tweaking:

image

Plus, some proof that the policy works for PR runs:

"Expiration": "expiry-date=\"Mon, 25 Sep 2023 00:00:00 GMT\", rule-id=\"delete-old-pr-caches\"",

Luckily, dbt-athena creates the query results bucket with an existing lifecycle policy that deletes all query results after 3 days -- nice!

Screenshot 2023-08-25 135118

Copy link
Member

@dfsnow dfsnow Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane That policy looks good to me!

suggestion (non-blocking): ccao-athena-results-us-east-1 is the bucket I setup for general Athena output. I also setup that lifecycle rule. I was talking about ccao-dbt-athena-ci-us-east-1 and ccao-dbt-athena-test-us-east-1, neither of which have lifecycle rules. IMO we probably should setup lifecycle rules for those as well. I think we're good to merge this though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, my bad @dfsnow, I totally misread the bucket name 🤦🏻‍♀️ I went ahead and copied that policy to ccao-dbt-athena-ci-us-east-1 and ccao-dbt-athena-test-us-east-1!

shell: bash
env:
KEY: ${{ inputs.key }}
CACHE_PATH: ${{ inputs.path }}
BUCKET: ${{ inputs.bucket }}
1 change: 0 additions & 1 deletion .github/variables/dbt.env
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
CACHE_NAME=dbt-cache
PROJECT_DIR=dbt
STATE_DIR=state
TARGET_DIR=target
29 changes: 7 additions & 22 deletions .github/workflows/build_and_test_dbt.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,35 +23,20 @@ jobs:
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}

# We have to use the separate `restore`/`save` actions instead of the
# unified `cache` action because only `restore` provides access to the
# `cache-matched-key` and `cache-primary-key` outputs as of v3
- name: Restore dbt state cache
id: cache
uses: actions/cache/restore@v3
uses: ./.github/actions/restore_dbt_cache
with:
path: ${{ env.PROJECT_DIR }}/${{ env.STATE_DIR }}
key: ${{ env.CACHE_NAME }}-${{ env.CACHE_KEY }}
restore-keys: |
${{ env.CACHE_NAME }}-master
key: ${{ env.CACHE_KEY }}
restore-key: ${{ env.CACHE_RESTORE_KEY }}

# If we restore the cache from the `restore-keys` key, the `cache-hit`
# output will be 'false' but the `cache-matched-key` output will be
# the name of the `restore-keys` key; we want to count this case as a hit
- if: |
steps.cache.outputs.cache-hit == 'true' ||
steps.cache.outputs.cache-matched-key == format(
'{0}-master', env.CACHE_NAME
)
- if: steps.cache.outputs.cache-hit == 'true'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice benefit of this change is that we can implement a definition of cache-hit that makes more sense to me (i.e. it's still a hit even if we restore from the fallback cache)!

name: Set command args to build/test modified resources
run: echo "MODIFIED_RESOURCES_ONLY=true" >> "$GITHUB_ENV"
shell: bash

- if: |
steps.cache.outputs.cache-hit != 'true' &&
steps.cache.outputs.cache-matched-key != format(
'{0}-master', env.CACHE_NAME
)
- if: steps.cache.outputs.cache-hit != 'true'
name: Set command args to build/test all resources
run: echo "MODIFIED_RESOURCES_ONLY=false" >> "$GITHUB_ENV"
shell: bash
Expand Down Expand Up @@ -86,7 +71,7 @@ jobs:
shell: bash

- name: Update dbt state cache
uses: actions/cache/save@v3
uses: ./.github/actions/save_dbt_cache
with:
path: ${{ env.PROJECT_DIR }}/${{ env.TARGET_DIR }}
key: ${{ env.CACHE_NAME }}-${{ env.CACHE_KEY }}
key: ${{ env.CACHE_KEY }}