Rename tag for iasWorld data tests and refactor script to run tests (#…

…571) * Update README.md to clarify role of dbt generics * Make some updates to dbt/README.md to sketch out QC doc structure * Add link to QC tests and reports section to root README.md * Continue fleshing out dbt test docs in dbt/README.md * Clean up data testing section of dbt/README.md * Update dbt/README.md to add docs for running QC reports * Finish fleshing out docs on QC reports in dbt/README.md * Move transform_dbt_test_results.py script to dbt/scripts/ subdirectory * Fix small issues in dbt/README.md * Clean up some trailing whitespace in dbt/README.md * Run pip install step in dbt/ subdir in test_dbt_models workflow * Clarify bug described in unit tests section of dbt/README.md * Clarify --rebuild docs in dbt/README.md * Small clarification in dbt/README.md * Apply a few small nitpicks from Dan's code review to dbt/README.md Co-authored-by: Dan Snow <31494343+dfsnow@users.noreply.github.com> * Standardize on "Valuations staff" in dbt/README.md * Use relative paths in dbt/README.md * Refactor meta.export_template option to not expect file extensions for export_models script * Refactor docs to use clearer terminology for iasWorld data tests * Rename tag for iasWorld data tests and refactor script to run tests * Small tweak to dbt/README.md * Better error handling for dbt invocation in run_iasworld_data_tests.py * Rename `data_test_iasworld` selector -> `select_data_test_iasworld` * Add --target flag to run_iasworld_data_tests script --------- Co-authored-by: Dan Snow <31494343+dfsnow@users.noreply.github.com>
ccao-data · Aug 19, 2024 · 348d7ec · 348d7ec
1 parent bd0c7bc
commit 348d7ec
Show file tree

Hide file tree

Showing 49 changed files with 1,799 additions and 296 deletions.
diff --git a/.github/workflows/build_and_test_dbt.yaml b/.github/workflows/build_and_test_dbt.yaml
@@ -84,14 +84,14 @@ jobs:
           if [[ $MODIFIED_RESOURCES_ONLY == 'true' ]]; then
             if [[ $MANUALLY_DISPATCHED == 'true' ]]; then
               echo "Running tests on manually selected resources"
-              dbt test -t "$TARGET" -s ${{ inputs.models }} --exclude "tag:test_qc*" --defer --state "$STATE_DIR"
+              dbt test -t "$TARGET" -s ${{ inputs.models }} --exclude "tag:data_test_iasworld" --defer --state "$STATE_DIR"
             else
               echo "Running tests on modified/new resources only"
-              dbt test -t "$TARGET" -s state:modified state:new --exclude "tag:test_qc*" --defer --state "$STATE_DIR"
+              dbt test -t "$TARGET" -s state:modified state:new --exclude "tag:data_test_iasworld" --defer --state "$STATE_DIR"
             fi
           else
             echo "Running tests on all resources"
-            dbt test -t "$TARGET" --exclude "tag:test_qc*"
+            dbt test -t "$TARGET" --exclude "tag:data_test_iasworld"
           fi
         working-directory: ${{ env.PROJECT_DIR }}
         shell: bash

diff --git a/.github/workflows/test_dbt_models.yaml b/.github/workflows/test_dbt_models.yaml
@@ -23,37 +23,14 @@ jobs:
           role-to-assume: ${{ secrets.AWS_IAM_ROLE_TO_ASSUME_ARN }}
 
       - name: Install Python requirements
-        run: pip install -r scripts/requirements.transform_dbt_test_results.txt
+        run: pip install -r scripts/requirements.run_iasworld_data_tests.txt
         working-directory: ${{ env.PROJECT_DIR }}
         shell: bash
 
       - name: Run tests
         run: |
-          # dbt doesn't differentiate between test failures and errors, but we
-          # need to since we expect failures. Do this by capturing the output
-          # and checking its contents to look for errors
-          if output=$(dbt test --target "$TARGET" --selector qc_tests --store-failures); then
-            echo "$output"
-          else
-            status_code="$?"
-            if [[ "$output" =~ "Runtime Error" || "$output" =~ "Compilation Error" ]]; then
-              # The presence of an error string indicates that these tests
-              # failed due to an error, so print the output and fail the
-              # pipeline
-              echo "$output"
-              exit "$status_code"
-            else
-              # The tests must have failed rather than errored out, so
-              # print the output but don't raise an error
-              echo "$output"
-            fi
-          fi
-        working-directory: ${{ env.PROJECT_DIR }}
-        shell: bash
-
-      - name: Extract test results
-        run: |
-          python3 scripts/transform_dbt_test_results.py \
+          python3 scripts/run_iasworld_data_tests.py \
+            --target "$TARGET" \
             --output-dir ./qc_test_results/
         working-directory: ${{ env.PROJECT_DIR }}
         shell: bash

diff --git a/dbt/README.md b/dbt/README.md
@@ -541,14 +541,9 @@ built for iasWorld data tests.
 
 #### Running iasWorld data tests
 
-The iasWorld data test suite can be run using the [`dbt test`
-command](https://docs.getdbt.com/reference/commands/test) with a dedicated
-[selector](https://docs.getdbt.com/reference/node-selection/yaml-selectors)
-and the [`--store-failures`
-flag](https://docs.getdbt.com/reference/resource-configs/store_failures),
-and its output can be transformed for review and analysis using the
-[`transform_dbt_test_results` script](./scripts/transform_dbt_test_results.py).
-This script reads the metadata for the most recent `dbt test` run and outputs a number of
+The iasWorld data test suite can be run using the [`run_iasworld_data_tests`
+script](./scripts/run_iasworld_data_tests.py).
+This script runs the tests and reads the metadata for the run to output a number of
 different artifacts with information about the tests:
 
 * An Excel workbook with detailed information on each failure to aid in resolving
@@ -572,25 +567,19 @@ Typically, Valuations staff will ask for test output for a specific township. We
 [township code](https://github.com/ccao-data/wiki/blob/master/Data/Townships.md) for this township
 using the bash variable `$TOWNSHIP_CODE`.
 
-First, run the tests locally using dbt and the [iasWorld data test
-selector](./selectors.yml):
+Run the tests locally using the [`run_iasworld_data_tests`
+script](./scripts/run_iasworld_data_tests.yml):
 
 ```bash
 # Make sure you're in the dbt subdirectory with the virtualenv activated
 cd dbt
 source venv/bin/activate
 
-# Run the tests and store failures in Athena
-dbt test --selector qc_tests --store-failures
+# Run the script
+python3 scripts/run_iasworld_data_tests.py --township $TOWNSHIP_CODE
 ```
 
-Next, transform the results for the township that Valuations staff requested:
-
-```bash
-python3 scripts/transform_dbt_test_results.py --township $TOWNSHIP_CODE
-```
-
-Finally, spot check the Excel workbook that the script produced to make sure it's formatted
+Then, check the Excel workbook that the script produced to make sure it's formatted
 correctly, and send it to Valuations staff for review.
 
 #### Adding iasWorld data tests
@@ -601,14 +590,14 @@ by the script:
 
 * One of either the test or the model that the test is defined on must be
 [tagged](https://docs.getdbt.com/reference/resource-configs/tags) with
-the tag `test_qc_iasworld`
+the tag `data_test_iasworld`
   * Prefer tagging the model, and fall back to tagging the test if for
     some reason the model cannot be tagged (e.g. if it has some non-QC
     tests defined on it)
   * If you would like to disable a data test but you don't want to remove it
-    altogether, you can tag it or its model with `test_qc_exclude_from_workbook`,
+    altogether, you can tag it or its model with `data_test_iasworld_exclude_from_workbook`,
     which will prevent the test (or all of the model's tests, if you tagged
-    the model) from running as part of the `qc_tests` selector
+    the model) from running as part of the `select_data_test_iasworld` selector
 * The test definition must supply a few specific parameters:
   * `name` must be set and follow the pattern
     `iasworld_<table_name>_<test_description>`
@@ -622,14 +611,14 @@ the tag `test_qc_iasworld`
       test you're using
   * `config.where` should typically set to provide a filter expression
     that restricts tests to unique rows and to rows matching a date range
-    set by the `test_qc_year_start` and `test_qc_year_end`
+    set by the `data_test_iasworld_year_start` and `data_test_iasworld_year_end`
     [project variables](https://docs.getdbt.com/docs/build/project-variables)
   * `meta` should be set with a few specific string attributes:
     * `description` (required): A short human-readable description of the test
     * `category` (optional): A workbook category for the test, required if
       a category is not defined for the test's generic in the `TEST_CATEGORIES`
-      constant in the [`transform_dbt_test_results`
-      script](./scripts/transform_dbt_test_results.py)
+      constant in the [`run_iasworld_data_tests`
+      script](./scripts/run_iasworld_data_tests.py)
     * `table_name` (optional): The name of the table to report in the output
       workbook, if the workbook should report a different table name than the
       name of the model that the test is defined on
@@ -674,8 +663,8 @@ do so, you have two options:
    and other tests. You'll also need to follow a few extra steps that are specific
    to our environment:
      1. Add a default category for your generic test in
-        the `TEST_CATEGORIES` constant in the [`transform_dbt_test_results`
-        script](./scripts/transform_dbt_test_results.py)
+        the `TEST_CATEGORIES` constant in the [`run_iasworld_data_tests`
+        script](./scripts/run_iasworld_data_tests.py)
      2. Make sure that your generic test supports the `additional_select_columns`
         parameter that most of our generic tests support, making use
         of the `format_additional_select_columns` macro to format the
@@ -767,8 +756,7 @@ model during export:
   reports at the same time, so we tag each model with the `qc_report_town_close` tag such that
   we can select them all at once when running the `export_models` script using
   `--select tag:qc_report_town_close`. For consistency, prefer tags that start with the `qc_report_*`
-  prefix, but beware not to use the `test_qc_*` prefix, which is instead used for [QC
-  tests](#adding-qc-tests).
+  prefix.
 * **Filtering**: Since the `export_models` script can filter your model using the `--where`
   option, you should define your model such that it selects any fields that you want to use
   for filtering in the `SELECT` clause. It's common to filter reports by `taxyr` and

diff --git a/dbt/dbt_project.yml b/dbt/dbt_project.yml
@@ -22,19 +22,18 @@ clean-targets:
   - "dbt_packages"
 
 # Variables that can change with invocation of dbt commands. Note that we parse
-# the test_* variables by name in scripts, so if you change their names, make
-# sure you check all of our scripts and adjust them accordingly
+# the data_test_iasworld_* variables by name in scripts, so if you change their
+# names, make sure you check all of our scripts and adjust them accordingly
 vars:
-  # Start year for testing data using test_qc_* tagged dbt tests. Typically
-  # this should be the current year, since errors in past data cannot usually
-  # be amended once records are closed. Set as an integer for compatibility
-  # with comparison operators and SQL's BETWEEN
-  test_qc_year_start: 2024
+  # Start year for iasWorld data tests. Typically this should be the current
+  # year, since errors in past data cannot usually be amended once records are
+  # closed. Set as an integer for compatibility with comparison operators and
+  # SQL's BETWEEN
+  data_test_iasworld_year_start: 2024
 
-  # End year for testing data using test_qc_* tagged dbt tests. Typically set
-  # to a date in the future, but can also be use to select specific time
-  # frames for testing
-  test_qc_year_end: 2030
+  # End year for iasWorld data tests. Typically set to a date in the future,
+  # but can also be use to select specific time frames for testing
+  data_test_iasworld_year_end: 2030
 
 # Configuring models
 # Full documentation: https://docs.getdbt.com/docs/configuring-models

diff --git a/dbt/models/default/schema/default.vw_pin_address.yml b/dbt/models/default/schema/default.vw_pin_address.yml
@@ -64,7 +64,7 @@ models:
             - mail_address_zipcode_1
             - mail_address_zipcode_2
           config:
-            where: CAST(year AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+            where: CAST(year AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
             error_if: ">900000"
       - unique_combination_of_columns:
           name: default_vw_pin_address_unique_by_14_digit_pin_and_year

diff --git a/dbt/models/default/schema/default.vw_pin_appeal.yml b/dbt/models/default/schema/default.vw_pin_appeal.yml
@@ -91,7 +91,7 @@ models:
             - mailed_tot
             - certified_tot
           config:
-            where: CAST(year AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+            where: CAST(year AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
             error_if: ">266719"
       # `change` should be an enum
       - expression_is_true:
@@ -103,7 +103,7 @@ models:
             - case_no
             - change
           config:
-            where: CAST(year AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+            where: CAST(year AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
       - row_count:
           name: default_vw_pin_appeal_row_count
           above: 8407667 # as of 2023-11-22

diff --git a/dbt/models/default/schema/default.vw_pin_value.yml b/dbt/models/default/schema/default.vw_pin_value.yml
@@ -80,59 +80,59 @@ models:
           name: default_vw_pin_value_mailed_class_not_null
           column_name: mailed_class
           config:
-            where: CAST(year AS int) < {{ var('test_qc_year_start') }}
+            where: CAST(year AS int) < {{ var('data_test_iasworld_year_start') }}
             error_if: ">289"
       - not_null:
           name: default_vw_pin_value_certified_class_not_null
           column_name: certified_class
           config:
-            where: CAST(year AS int) < {{ var('test_qc_year_start') }}
+            where: CAST(year AS int) < {{ var('data_test_iasworld_year_start') }}
             error_if: ">15"
       - not_null:
           name: default_vw_pin_value_board_class_not_null
           column_name: board_class
           config:
-            where: CAST(year AS int) < {{ var('test_qc_year_start') }} - 1
+            where: CAST(year AS int) < {{ var('data_test_iasworld_year_start') }} - 1
             error_if: ">1260"
       - not_null:
           name: default_vw_pin_value_mailed_tot_not_null
           column_name: mailed_tot
           config:
-            where: CAST(year AS int) < {{ var('test_qc_year_start') }}
+            where: CAST(year AS int) < {{ var('data_test_iasworld_year_start') }}
             error_if: ">310"
       - not_null:
           name: default_vw_pin_value_certified_tot_not_null
           column_name: certified_tot
           config:
-            where: CAST(year AS int) < {{ var('test_qc_year_start') }}
+            where: CAST(year AS int) < {{ var('data_test_iasworld_year_start') }}
             error_if: ">15"
       - not_null:
           name: default_vw_pin_value_board_tot_not_null
           column_name: board_tot
           config:
-            where: CAST(year AS int) < {{ var('test_qc_year_start') }} - 1
+            where: CAST(year AS int) < {{ var('data_test_iasworld_year_start') }} - 1
             error_if: ">1260"
       - not_null:
           name: default_vw_pin_value_mailed_tot_mv_not_null
           column_name: mailed_tot_mv
           config:
             where: |
-              CAST(year AS int) < {{ var('test_qc_year_start') }} AND
+              CAST(year AS int) < {{ var('data_test_iasworld_year_start') }} AND
               year >= '2021'
             error_if: ">310"
       - not_null:
           name: default_vw_pin_value_certified_tot_mv_not_null
           column_name: certified_tot_mv
           config:
             where: |
-              CAST(year AS int) < {{ var('test_qc_year_start') }} AND
+              CAST(year AS int) < {{ var('data_test_iasworld_year_start') }} AND
               year >= '2021'
             error_if: ">15"
       - not_null:
           name: default_vw_pin_value_board_tot_mv_not_null
           column_name: board_tot_mv
           config:
             where: |
-              CAST(year AS int) < {{ var('test_qc_year_start') }} - 1 AND
+              CAST(year AS int) < {{ var('data_test_iasworld_year_start') }} - 1 AND
               year >= '2020'
             error_if: ">1260"
diff --git a/dbt/models/iasworld/schema/iasworld.aasysjur.yml b/dbt/models/iasworld/schema/iasworld.aasysjur.yml
@@ -3,7 +3,7 @@ sources:
     loaded_at_field: date_parse(loaded_at, '%Y-%m-%d %H:%i:%S.%f')
     tags:
       - load_auto
-      - test_qc_iasworld
+      - data_test_iasworld
 
     tables:
       - name: aasysjur

diff --git a/dbt/models/iasworld/schema/iasworld.addn.yml b/dbt/models/iasworld/schema/iasworld.addn.yml
@@ -3,7 +3,7 @@ sources:
     loaded_at_field: date_parse(loaded_at, '%Y-%m-%d %H:%i:%S.%f')
     tags:
       - load_auto
-      - test_qc_iasworld
+      - data_test_iasworld
       - type_res
 
     tables:
@@ -30,7 +30,7 @@ sources:
                     - wen
                   config: &unique-conditions
                     where: |
-                      CAST(taxyr AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+                      CAST(taxyr AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
                       AND cur = 'Y'
                       AND deactivat IS NULL
                   meta:
@@ -75,7 +75,7 @@ sources:
                   additional_select_columns: *select-columns
                   config:
                     where: |
-                      CAST(taxyr AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+                      CAST(taxyr AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
                       AND class != 'EX'
                       AND cur = 'Y'
                       AND deactivat IS NULL
@@ -93,7 +93,7 @@ sources:
                   additional_select_columns: *select-columns
                   config:
                     where: |
-                      CAST(taxyr AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+                      CAST(taxyr AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
                   meta:
                     description: cur should be 'Y' or 'D'
           - name: deactivat

diff --git a/dbt/models/iasworld/schema/iasworld.addrindx.yml b/dbt/models/iasworld/schema/iasworld.addrindx.yml
@@ -3,7 +3,7 @@ sources:
     loaded_at_field: date_parse(loaded_at, '%Y-%m-%d %H:%i:%S.%f')
     tags:
       - load_auto
-      - test_qc_iasworld
+      - data_test_iasworld
 
     tables:
       - name: addrindx

diff --git a/dbt/models/iasworld/schema/iasworld.aprval.yml b/dbt/models/iasworld/schema/iasworld.aprval.yml
@@ -3,7 +3,7 @@ sources:
     loaded_at_field: date_parse(loaded_at, '%Y-%m-%d %H:%i:%S.%f')
     tags:
       - load_auto
-      - test_qc_iasworld
+      - data_test_iasworld
 
     tables:
       - name: aprval
@@ -34,7 +34,7 @@ sources:
                     - wen
                   config: &unique-conditions
                     where: |
-                      CAST(taxyr AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+                      CAST(taxyr AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
                       AND cur = 'Y'
                       AND deactivat IS NULL
                   meta:
@@ -121,7 +121,7 @@ sources:
                   additional_select_columns: *select-columns
                   config:
                     where: |
-                      CAST(taxyr AS int) BETWEEN {{ var('test_qc_year_start') }} AND {{ var('test_qc_year_end') }}
+                      CAST(taxyr AS int) BETWEEN {{ var('data_test_iasworld_year_start') }} AND {{ var('data_test_iasworld_year_end') }}
                   meta:
                     description: cur should be 'Y' or 'D'
           - name: deactivat