-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add conditional uniqueness tests for iasWorld tables #98
Add conditional uniqueness tests for iasWorld tables #98
Conversation
…s for testing" This reverts commit 913db16.
…for-iasworld-tables
@@ -10,7 +10,7 @@ | |||
-- Adapted from dbt_utils.unique_combination_of_columns, and adjusted to add the | |||
-- optional duplicate threshold and to only report one row for each dupe. | |||
{% test unique_combination_of_columns( | |||
model, combination_of_columns, allowed_duplicates=0 | |||
model, combination_of_columns, allowed_duplicates=0, where=null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where
is technically a reserved keyword in SQL, but it doesn't seem to cause problems when we use it as a jinja variable, and I like the simplicity of the interface that it allows for!
Oops, converting back to a draft since I think I found a syntax error! |
Alright, we should be good now 👍🏻 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @jeancochrane! This will be super useful very quickly as we start building out more complex tests.
question (non-blocking): My only question is how does this relate to the existing where test macro. I assume we can't/shouldn't use that functionality in this case?
) %} | ||
|
||
{%- set columns_csv = combination_of_columns | join(", ") %} | ||
|
||
select {{ columns_csv }}, count(*) as num_duplicates | ||
from {{ model }} | ||
{% if where %} where ({{ where }}) {% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is really funny if you don't know what's going on.
question (non-blocking): Am I correct in my understanding that dbt/jinja interpret any non-null value as truthy here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true! I would be on board for changing it to if where is not null
, but I ended up switching to using the builtin config.where
filter in 8e72252 so I think the question is moot.
dbt/models/iasworld/schema.yml
Outdated
combination_of_columns: | ||
- parid | ||
- instruno | ||
where: substr(saledt, 1, 4) >= '2023' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (non-blocking): This conditional hopefully shouldn't be necessary to make the test pass (sales should be unique by document number). If it is, we may need to borrow some of the more complex logic of the sales view, i.e:
WHERE sales.instruno IS NOT NULL
AND sales.deactivat IS NULL
AND tc.township_code IS NOT NULL
@wrridgeway Any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the data context here, but I wanted to confirm that the test doesn't pass if you remove the where
clause:
18:07:39 Failure in test sales_unique_by_parid_instruno (models/iasworld/schema.yml)
18:07:39 Got 2133988 results, configured to fail if != 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still curious what @wrridgeway thinks, but I'm going to go ahead and merge in the meantime to unblock test development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sales, yeah, we can just do unique by doc_no
, but we have to include is_multisale = FALSE
as well. we should get two rows that fail. sorry it took so long to respond @jeancochrane , my laptop was dead all day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wrridgeway Does that mean we should remove this test and replace it with one like this?
- name: sales
tests:
- unique_combination_of_columns:
name: sales_unique_by_doc_no
combination_of_columns:
- doc_no
config:
where: is_multisale = false
error_if: ">2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wrridgeway You're thinking of vw_pin_sale
. This is for iasworld.sales
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true, but in that case, we do not expect iasworld.sales
to be unique by doc_no
. the test i provided jean and which she implemented is the best i can do currently for uniqueness in that table. it is encouraging that for 2023 things are unique by parid
and instruno
- that's a HUGE improvement. it's also worth noting that i don't think we should ever expect the sales table to be unique only by instruno
since multi-pin sales are legitimate.
) %} | ||
|
||
{%- set columns_csv = combination_of_columns | join(", ") %} | ||
|
||
select {{ columns_csv }}, count(*) as num_duplicates | ||
from {{ model }} | ||
{% if where %} where ({{ where }}) {% endif %} | ||
group by {{ columns_csv }} | ||
having count(*) > {{ allowed_duplicates }} + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought: We may run into lots of cases where this dupe test triggers due to many null
values, while the actual values are unique. This conditional will be especially useful in those cases.
procname in ('CCAOVALUE', 'CCAOFINAL', 'BORVALUE') and | ||
valclass is null | ||
config: | ||
error_if: ">125" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thought (dreadful): It's going to be so much fun to figure out what all these errors are...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dbt/models/iasworld/schema.yml
Outdated
combination_of_columns: | ||
- parid | ||
- instruno | ||
where: substr(saledt, 1, 4) >= '2023' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the data context here, but I wanted to confirm that the test doesn't pass if you remove the where
clause:
18:07:39 Failure in test sales_unique_by_parid_instruno (models/iasworld/schema.yml)
18:07:39 Got 2133988 results, configured to fail if != 0
) %} | ||
|
||
{%- set columns_csv = combination_of_columns | join(", ") %} | ||
|
||
select {{ columns_csv }}, count(*) as num_duplicates | ||
from {{ model }} | ||
{% if where %} where ({{ where }}) {% endif %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true! I would be on board for changing it to if where is not null
, but I ended up switching to using the builtin config.where
filter in 8e72252 so I think the question is moot.
This PR adds a new set of uniqueness tests for iasWorld tables, following up on #93. The tests in this PR are only unique for rows that meet a certain set of conditions, which means that we need to use the
config.where
filter attribute to define the set of rows that should be unique.Closes #87.