Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add conditional uniqueness tests for iasWorld tables #98

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Aug 24, 2023

This PR adds a new set of uniqueness tests for iasWorld tables, following up on #93. The tests in this PR are only unique for rows that meet a certain set of conditions, which means that we need to use the config.where filter attribute to define the set of rows that should be unique.

Closes #87.

@jeancochrane jeancochrane marked this pull request as ready for review August 24, 2023 22:35
@jeancochrane jeancochrane requested a review from a team as a code owner August 24, 2023 22:35
@@ -10,7 +10,7 @@
-- Adapted from dbt_utils.unique_combination_of_columns, and adjusted to add the
-- optional duplicate threshold and to only report one row for each dupe.
{% test unique_combination_of_columns(
model, combination_of_columns, allowed_duplicates=0
model, combination_of_columns, allowed_duplicates=0, where=null
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is technically a reserved keyword in SQL, but it doesn't seem to cause problems when we use it as a jinja variable, and I like the simplicity of the interface that it allows for!

@jeancochrane jeancochrane requested review from wagnerlmichael and dfsnow and removed request for wagnerlmichael and dfsnow August 24, 2023 22:36
@jeancochrane jeancochrane marked this pull request as draft August 24, 2023 22:37
@jeancochrane
Copy link
Contributor Author

Oops, converting back to a draft since I think I found a syntax error!

@jeancochrane jeancochrane marked this pull request as ready for review August 24, 2023 22:55
@jeancochrane
Copy link
Contributor Author

Alright, we should be good now 👍🏻

Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @jeancochrane! This will be super useful very quickly as we start building out more complex tests.

question (non-blocking): My only question is how does this relate to the existing where test macro. I assume we can't/shouldn't use that functionality in this case?

) %}

{%- set columns_csv = combination_of_columns | join(", ") %}

select {{ columns_csv }}, count(*) as num_duplicates
from {{ model }}
{% if where %} where ({{ where }}) {% endif %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is really funny if you don't know what's going on.

question (non-blocking): Am I correct in my understanding that dbt/jinja interpret any non-null value as truthy here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true! I would be on board for changing it to if where is not null, but I ended up switching to using the builtin config.where filter in 8e72252 so I think the question is moot.

combination_of_columns:
- parid
- instruno
where: substr(saledt, 1, 4) >= '2023'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (non-blocking): This conditional hopefully shouldn't be necessary to make the test pass (sales should be unique by document number). If it is, we may need to borrow some of the more complex logic of the sales view, i.e:

WHERE sales.instruno IS NOT NULL
    AND sales.deactivat IS NULL
    AND tc.township_code IS NOT NULL

@wrridgeway Any thoughts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the data context here, but I wanted to confirm that the test doesn't pass if you remove the where clause:

18:07:39  Failure in test sales_unique_by_parid_instruno (models/iasworld/schema.yml)
18:07:39    Got 2133988 results, configured to fail if != 0      

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still curious what @wrridgeway thinks, but I'm going to go ahead and merge in the meantime to unblock test development.

Copy link
Member

@wrridgeway wrridgeway Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sales, yeah, we can just do unique by doc_no, but we have to include is_multisale = FALSE as well. we should get two rows that fail. sorry it took so long to respond @jeancochrane , my laptop was dead all day.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrridgeway Does that mean we should remove this test and replace it with one like this?

- name: sales
  tests:
    - unique_combination_of_columns:
      name: sales_unique_by_doc_no
      combination_of_columns:
        - doc_no
      config:
        where: is_multisale = false
        error_if: ">2

Copy link
Member

@dfsnow dfsnow Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wrridgeway You're thinking of vw_pin_sale. This is for iasworld.sales.

Copy link
Member

@wrridgeway wrridgeway Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is true, but in that case, we do not expect iasworld.sales to be unique by doc_no. the test i provided jean and which she implemented is the best i can do currently for uniqueness in that table. it is encouraging that for 2023 things are unique by parid and instruno - that's a HUGE improvement. it's also worth noting that i don't think we should ever expect the sales table to be unique only by instruno since multi-pin sales are legitimate.

) %}

{%- set columns_csv = combination_of_columns | join(", ") %}

select {{ columns_csv }}, count(*) as num_duplicates
from {{ model }}
{% if where %} where ({{ where }}) {% endif %}
group by {{ columns_csv }}
having count(*) > {{ allowed_duplicates }} + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: We may run into lots of cases where this dupe test triggers due to many null values, while the actual values are unique. This conditional will be especially useful in those cases.

procname in ('CCAOVALUE', 'CCAOFINAL', 'BORVALUE') and
valclass is null
config:
error_if: ">125"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought (dreadful): It's going to be so much fun to figure out what all these errors are...

Copy link
Contributor Author

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job finding the builtin config.where attribute @dfsnow, I'm kicking myself for missing that! Turns out it produces identical output, so I switched to using it in 8e72252 and undid my changes to unique_combination_of_columns.

combination_of_columns:
- parid
- instruno
where: substr(saledt, 1, 4) >= '2023'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the data context here, but I wanted to confirm that the test doesn't pass if you remove the where clause:

18:07:39  Failure in test sales_unique_by_parid_instruno (models/iasworld/schema.yml)
18:07:39    Got 2133988 results, configured to fail if != 0      

) %}

{%- set columns_csv = combination_of_columns | join(", ") %}

select {{ columns_csv }}, count(*) as num_duplicates
from {{ model }}
{% if where %} where ({{ where }}) {% endif %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true! I would be on board for changing it to if where is not null, but I ended up switching to using the builtin config.where filter in 8e72252 so I think the question is moot.

@jeancochrane jeancochrane merged commit 1a4424a into master Aug 25, 2023
3 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/87-add-conditional-uniqueness-tests-for-iasworld-tables branch August 25, 2023 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add conditional uniqueness tests for iasWorld tables
3 participants