Move dbt cache to remote S3 storage #90

jeancochrane · 2023-08-22T19:26:13Z

This PR adjusts the way that we store our dbt state caches to get around the fact that GitHub Actions caches cannot be updated once they've been created. By moving our state caches to S3, we can ensure that the cache always gets updated at the end of every successful run of the build-and-test-dbt workflow, which will allow subsequent runs to only build and test resources that have been modified since the last successful run.

See this run for evidence of the remote cache being used successfully.

Closes #89.

…kflow to use it

…gging

…for debugging" This reverts commit 62179f1.

…action env

jeancochrane · 2023-08-24T21:32:52Z

.github/actions/restore_dbt_cache/action.yaml

+  key:
+    description: The cache key to query for.
+    required: true
+  restore-key:


The GitHub Action that this action is based on accepts restore-keys as a newline-delimited list, but for our purposes I only ever expect we'll need to use one fallback (for the master cache), so we gain some simplicity by switching to accepting a string value.

Makes sense, I'm always in favor of more simplicity.

jeancochrane · 2023-08-24T21:33:35Z

.github/actions/restore_dbt_cache/action.yaml

+  bucket:
+    description: The S3 bucket that should store the cache.
+    required: false
+    default: ccao-dbt-cache-us-east-1


I don't think it should be a problem to have this bucket name be public; the bucket itself is not public, and there shouldn't be anything sensitive in the dbt manifest file even if someone did manage to force their way in.

thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.

jeancochrane · 2023-08-24T21:38:08Z

.github/actions/restore_dbt_cache/action.yaml

+    - name: Check for a cache key match
+      id: query-cache-keys
+      run: |
+        if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then


Note that this simple if ... then conditional will fail on any nonzero status from aws s3api head-object, not just the expected 404. I think this is acceptable, since we want to continue without the cache in case of an unexpected error (e.g. a 403 forbidden due to bad credentials), and testing confirms that the aws s3api head-object call will print the exit status to the console in these cases so we should be able to debug them.

An error occurred (403) when calling the HeadObject operation: Forbidden

praise: Nice, this is perfect. Super nice simplifications all around.

jeancochrane · 2023-08-24T21:41:01Z

.github/actions/restore_dbt_cache/action.yaml

+    # In case of a cache hit, copy the contents of the cache to the path
+    - if: steps.query-cache-keys.outputs.cache-hit == 'true'
+      name: Copy cache to path
+      run: aws s3 cp "s3://$BUCKET/$KEY/manifest.json" "$CACHE_PATH/manifest.json"


In order to support listing, pushing, and pulling objects from this bucket, I adjusted policy for the IAM role that the GitHub Actions runner assumes to give it limited read/write access to the ccao-dbt-cache-us-east-1 bucket.

jeancochrane · 2023-08-24T21:45:58Z

.github/actions/save_dbt_cache/action.yaml

+  using: composite
+  steps:
+    - name: Save dbt cache
+      run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json"


This remote storage naming mostly works well, but in pull requests where the branch name includes a slash, it results in some potentially confusing nested folder behavior:

upload: dbt/target/manifest.json to s3://ccao-dbt-cache-us-east-1/dbt-cache-jeancochrane/89-fix-broken-cache-restoration/manifest.json

My guess is that this is probably fine, since the cache bucket will mostly be used by bots and it might even be nice to have higher-level organization (e.g. anything in the dbt-cache-jeancochrane subfolder likely corresponds to a cache for a branch that I authored).

I also wonder if we want to add a lifecycle configuration to this bucket? It'll need a bit of research, since we'll want the dbt-cache-master subfolder to persist indefinitely while everything else can be deleted after a period of inactivity. Curious what you think about all this!

IMO this is totally fine, since none of us are likely to use this bucket manually.

suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.

Cool! I had to adjust the cache keys to create a consistent prefix for PR caches that we could use in the lifecycle policy (pr-caches/), but I did that in 74031b6 and added a policy. Does this look right to you? The difference between expiration and deletion is a bit over my head, so perhaps it needs a little tweaking:

Plus, some proof that the policy works for PR runs:

"Expiration": "expiry-date=\"Mon, 25 Sep 2023 00:00:00 GMT\", rule-id=\"delete-old-pr-caches\"",

Luckily, dbt-athena creates the query results bucket with an existing lifecycle policy that deletes all query results after 3 days -- nice!

@jeancochrane That policy looks good to me!

suggestion (non-blocking): ccao-athena-results-us-east-1 is the bucket I setup for general Athena output. I also setup that lifecycle rule. I was talking about ccao-dbt-athena-ci-us-east-1 and ccao-dbt-athena-test-us-east-1, neither of which have lifecycle rules. IMO we probably should setup lifecycle rules for those as well. I think we're good to merge this though!

Oops, my bad @dfsnow, I totally misread the bucket name 🤦🏻‍♀️ I went ahead and copied that policy to ccao-dbt-athena-ci-us-east-1 and ccao-dbt-athena-test-us-east-1!

jeancochrane · 2023-08-24T21:47:37Z

.github/workflows/build_and_test_dbt.yaml

-          steps.cache.outputs.cache-matched-key == format(
-            '{0}-master', env.CACHE_NAME
-          )
+      - if: steps.cache.outputs.cache-hit == 'true'


One nice benefit of this change is that we can implement a definition of cache-hit that makes more sense to me (i.e. it's still a hit even if we restore from the fallback cache)!

dfsnow

Nice work @jeancochrane! Love that this actually simplifies things in many ways and that we can now more easily fetch the manifest for local testing if needed. Merge at your leisure!

dfsnow · 2023-08-24T22:50:52Z

.github/actions/restore_dbt_cache/action.yaml

+  key:
+    description: The cache key to query for.
+    required: true
+  restore-key:


Makes sense, I'm always in favor of more simplicity.

dfsnow · 2023-08-24T22:53:37Z

.github/actions/restore_dbt_cache/action.yaml

+  bucket:
+    description: The S3 bucket that should store the cache.
+    required: false
+    default: ccao-dbt-cache-us-east-1


thought: Agreed. No problem here and it tracks with how we've treated other buckets/bucket names.

dfsnow · 2023-08-24T23:00:43Z

.github/actions/restore_dbt_cache/action.yaml

+    - name: Check for a cache key match
+      id: query-cache-keys
+      run: |
+        if aws s3api head-object --bucket "$BUCKET" --key "$KEY/manifest.json"; then


praise: Nice, this is perfect. Super nice simplifications all around.

dfsnow · 2023-08-24T23:08:54Z

.github/actions/save_dbt_cache/action.yaml

+  using: composite
+  steps:
+    - name: Save dbt cache
+      run: aws s3 cp "$CACHE_PATH/manifest.json" "s3://$BUCKET/$KEY/manifest.json"


IMO this is totally fine, since none of us are likely to use this bucket manually.

suggestion (non-blocking): Let's definitely set some lifecycle rules though just to keep things tidy (likewise on the dbt Athena results buckets if there aren't any already 👀). IMO 60 days to mark all non-master keys for deletion with another 30 days to delete is sufficient. Let me know if you want to set it up or if I should grab it.

jeancochrane linked an issue Aug 22, 2023 that may be closed by this pull request

Fix broken cache restoration #89

Closed

Fix broken dbt cache restoration by saving cache to the same path

635fcb3

jeancochrane force-pushed the jeancochrane/89-fix-broken-cache-restoration branch from e89d955 to 635fcb3 Compare August 22, 2023 19:31

jeancochrane added 5 commits August 22, 2023 17:20

Add new restore_dbt_cache composite action to restore dbt cache from S3

12922c4

Add save_dbt_cache composite action and update build_and_test_dbt wor…

455f22d

…kflow to use it

Merge branch 'master' into jeancochrane/89-fix-broken-cache-restoration

addcb58

Add shell: bash attr to save/restore dbt cache actions

c047d95

Temporarily add tmate session to build_and_test_dbt workflow for debu…

62179f1

…gging

jeancochrane force-pushed the jeancochrane/89-fix-broken-cache-restoration branch from 1c792cf to 62179f1 Compare August 24, 2023 19:52

jeancochrane added 2 commits August 24, 2023 14:55

Revert "Temporarily add tmate session to build_and_test_dbt workflow …

313852a

…for debugging" This reverts commit 62179f1.

Use CACHE_PATH instead of clobbering PATH for save/restore dbt cache …

7530947

…action env

jeancochrane changed the title ~~Fix broken dbt cache restoration by saving cache to the same path~~ Move dbt cache to remote S3 storage Aug 24, 2023

jeancochrane commented Aug 24, 2023

View reviewed changes

Remove extraneous comments in restore_dbt_cache action

92b6e80

jeancochrane marked this pull request as ready for review August 24, 2023 22:35

jeancochrane requested a review from a team as a code owner August 24, 2023 22:35

jeancochrane requested review from dfsnow and wagnerlmichael and removed request for wagnerlmichael August 24, 2023 22:35

dfsnow approved these changes Aug 24, 2023

View reviewed changes

Change dbt cache prefix pattern to accommodate S3 lifecycle config

74031b6

jeancochrane force-pushed the jeancochrane/89-fix-broken-cache-restoration branch from 5beba2e to 74031b6 Compare August 25, 2023 18:50

jeancochrane merged commit c3e8922 into master Aug 25, 2023
3 checks passed

jeancochrane deleted the jeancochrane/89-fix-broken-cache-restoration branch August 25, 2023 20:57

jeancochrane mentioned this pull request Aug 30, 2023

[Data catalog] Fix cleanup-dbt-resources workflow so it doesn't error in case of deferral #104

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move dbt cache to remote S3 storage #90

Move dbt cache to remote S3 storage #90

jeancochrane commented Aug 22, 2023 •

edited

Loading

jeancochrane Aug 24, 2023

dfsnow Aug 24, 2023

jeancochrane Aug 24, 2023 •

edited

Loading

dfsnow Aug 24, 2023

jeancochrane Aug 24, 2023

dfsnow Aug 24, 2023

jeancochrane Aug 24, 2023

jeancochrane Aug 24, 2023

dfsnow Aug 24, 2023

jeancochrane Aug 25, 2023 •

edited

Loading

dfsnow Aug 25, 2023 •

edited

Loading

jeancochrane Aug 25, 2023

jeancochrane Aug 24, 2023

dfsnow left a comment

dfsnow Aug 24, 2023

dfsnow Aug 24, 2023

dfsnow Aug 24, 2023

dfsnow Aug 24, 2023

Move dbt cache to remote S3 storage #90

Move dbt cache to remote S3 storage #90

Conversation

jeancochrane commented Aug 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

dfsnow Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Aug 22, 2023 •

edited

Loading

jeancochrane Aug 24, 2023 •

edited

Loading

jeancochrane Aug 25, 2023 •

edited

Loading

dfsnow Aug 25, 2023 •

edited

Loading