Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional uniqueness tests for iasWorld tables #98

Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 35 additions & 3 deletions dbt/models/iasworld/schema.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,20 @@ sources:
filter: &latest_taxyr taxyr >= date_format(current_date - interval '1' year, '%Y')
warn_after: &24_hours {count: 24, period: hour}
error_after: &48_hours {count: 48, period: hour}
tests:
- unique_combination_of_columns:
name: asmt_all_unique_by_parid_procname_and_taxyr
combination_of_columns:
- parid
- procname
- taxyr
where: >-
cur = 'Y' and
deactivat is null and
procname in ('CCAOVALUE', 'CCAOFINAL', 'BORVALUE') and
valclass is null
config:
error_if: ">125"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (dreadful): It's going to be so much fun to figure out what all these errors are...

- name: asmt_hist
- name: cname
- name: comdat
Expand Down Expand Up @@ -73,6 +87,17 @@ sources:
filter: *latest_taxyr
warn_after: *24_hours
error_after: *48_hours
tests:
- unique_combination_of_columns:
name: htpar_unique_by_parid_caseno_taxyr_subkey
combination_of_columns:
- parid
- caseno
- taxyr
- subkey
where: cur = 'Y' and deactivat is null
config:
error_if: ">2"
- name: land
description: '{{ doc("land") }}'
tests:
Expand All @@ -89,7 +114,7 @@ sources:
name: legdat_unique_by_parid_taxyr
combination_of_columns:
- parid
- taxyr
- taxyr
- name: lpmod
- name: lpnbhd
- name: oby
Expand All @@ -109,21 +134,28 @@ sources:
name: owndat_unique_by_parid_taxyr
combination_of_columns:
- parid
- taxyr
- taxyr
- name: pardat
description: '{{ doc("pardat") }}'
tests:
- unique_combination_of_columns:
name: pardat_unique_by_parid_taxyr
combination_of_columns:
- parid
- taxyr
- taxyr
- name: permit
freshness:
filter: date_format(date_parse(permdt, '%Y-%m-%d %H:%i:%s.0'), '%Y') >= date_format(current_date - interval '1' year, '%Y')
warn_after: *48_hours
error_after: &72_hours {count: 72, period: hour}
- name: rcoby
- name: sales
tests:
- unique_combination_of_columns:
name: sales_unique_by_parid_instruno
combination_of_columns:
- parid
- instruno
where: substr(saledt, 1, 4) >= '2023'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking): This conditional hopefully shouldn't be necessary to make the test pass (sales should be unique by document number). If it is, we may need to borrow some of the more complex logic of the sales view, i.e:

WHERE sales.instruno IS NOT NULL
    AND sales.deactivat IS NULL
    AND tc.township_code IS NOT NULL

@wrridgeway Any thoughts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the data context here, but I wanted to confirm that the test doesn't pass if you remove the where clause:

18:07:39  Failure in test sales_unique_by_parid_instruno (models/iasworld/schema.yml)
18:07:39    Got 2133988 results, configured to fail if != 0      

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still curious what @wrridgeway thinks, but I'm going to go ahead and merge in the meantime to unblock test development.

Copy link
Member

@wrridgeway wrridgeway Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sales, yeah, we can just do unique by doc_no, but we have to include is_multisale = FALSE as well. we should get two rows that fail. sorry it took so long to respond @jeancochrane , my laptop was dead all day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrridgeway Does that mean we should remove this test and replace it with one like this?

- name: sales
  tests:
    - unique_combination_of_columns:
      name: sales_unique_by_doc_no
      combination_of_columns:
        - doc_no
      config:
        where: is_multisale = false
        error_if: ">2

Copy link
Member

@dfsnow dfsnow Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrridgeway You're thinking of vw_pin_sale. This is for iasworld.sales.

Copy link
Member

@wrridgeway wrridgeway Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true, but in that case, we do not expect iasworld.sales to be unique by doc_no. the test i provided jean and which she implemented is the best i can do currently for uniqueness in that table. it is encouraging that for 2023 things are unique by parid and instruno - that's a HUGE improvement. it's also worth noting that i don't think we should ever expect the sales table to be unique only by instruno since multi-pin sales are legitimate.

- name: splcom
- name: valclass
14 changes: 9 additions & 5 deletions dbt/tests/generic/test_unique_combination_of_columns.sql
Original file line number Diff line number Diff line change
@@ -1,22 +1,26 @@
-- Test that a given set of columns are unique, with an optional
-- threshold indicating an allowable number of duplicates.
-- threshold indicating an allowable number of duplicates and an optional
-- filter clause to restrict the set of rows that should be unique.
--
-- For example, test that a given PIN has been sold no more than
-- twice in one year.
-- twice in one year, for only active rows.
--
-- The duplicate threshold defaults to 1, in which case this is a standard
-- uniqueness test.
-- uniqueness test. The where clause defaults to null, which indicates
-- that the full set of rows should be unique on the given columns.
--
-- Adapted from dbt_utils.unique_combination_of_columns, and adjusted to add the
-- optional duplicate threshold and to only report one row for each dupe.
-- optional duplicate threshold, to add the optional where clause, and to only
-- report one row for each dupe.
{% test unique_combination_of_columns(
model, combination_of_columns, allowed_duplicates=0
model, combination_of_columns, allowed_duplicates=0, where=null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is technically a reserved keyword in SQL, but it doesn't seem to cause problems when we use it as a jinja variable, and I like the simplicity of the interface that it allows for!

) %}

{%- set columns_csv = combination_of_columns | join(", ") %}

select {{ columns_csv }}, count(*) as num_duplicates
from {{ model }}
{% if where %} where ({{ where }}) {% endif %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really funny if you don't know what's going on.

question (non-blocking): Am I correct in my understanding that dbt/jinja interpret any non-null value as truthy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true! I would be on board for changing it to if where is not null, but I ended up switching to using the builtin config.where filter in 8e72252 so I think the question is moot.

group by {{ columns_csv }}
having count(*) > {{ allowed_duplicates }} + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We may run into lots of cases where this dupe test triggers due to many null values, while the actual values are unique. This conditional will be especially useful in those cases.


Expand Down
Loading