Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proximity.dist_pin_to_pin to the dbt DAG using intermediate transforms #138

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Sep 18, 2023

This PR changes proximity.dist_pin_to_pin from a source to a ref so that it can be tracked in the dbt DAG. In order to build the table in a reasonable amount of time, we define a series of intermediate transformations, documented in dbt/models/proximity/docs.md, that expand the PIN-to-PIN buffer gradually so that we only compute the 10km buffer for PINs that do not have neighbors within 1km.

As evidenced by this successful GitHub workflow run, this new design takes about 14 minutes to build dist_pin_to_pin.

Closes #122.

@jeancochrane jeancochrane changed the title Add proximity.dist_pin_to_pin to the DAG using intermediate transforms Add proximity.dist_pin_to_pin to the DAG using intermediate transforms Sep 18, 2023
@jeancochrane jeancochrane changed the title Add proximity.dist_pin_to_pin to the DAG using intermediate transforms Add proximity.dist_pin_to_pin to the dbt DAG using intermediate transforms Sep 18, 2023
.sqlfluff Show resolved Hide resolved
-- * year
-- * x_3435
-- * y_3435
{% macro nearest_pin_neighbors(source_model, num_neighbors, radius_km) %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro is nearly identical to the pre-existing aws-athena/ctas/proximity-dist_pin_to_pin.sql query, with three main changes:

  1. Parameterization of source_model, num_neigbors, and radius_km
  2. Addition of pin10 to the distinct_pins subquery per our missing data discussion on Teams (diff here)
  3. Reformatting to match dbt styles using sqlfmt

Everything else should be identical!

dbt/models/proximity/docs.md Outdated Show resolved Hide resolved
@jeancochrane jeancochrane marked this pull request as ready for review September 20, 2023 18:59
@jeancochrane jeancochrane requested a review from a team as a code owner September 20, 2023 18:59
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeancochrane Great work! When we visit #130 it may be worth it to see what else we can abstract into macros. See my one blocking issue, but otherwise looks good.

.sqlfluff Show resolved Hide resolved
DROP TABLE IF EXISTS proximity.dist_pin_to_pin_temp
DROP TABLE IF EXISTS proximity.dist_pin_to_pin_temp2
-- View that finds the 3 nearest neighbor PINs for every PIN for every year
SELECT * FROM {{ ref('proximity.dist_pin_to_pin_10km') }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (blocking): If dist_pin_to_pin_10km.sql only contains the values missing from dist_pin_to_pin_1km.sql, then shouldn't this be the union of dist_pin_to_pin_1km.sql and dist_pin_to_pin_10km.sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, excellent catch @dfsnow! Embarrassing mistake on my part 🙈 This is fixed in a0f7423, and I updated the table docs to reflect it 👍🏻

dbt/macros/nearest_pin_neighbors.sql Show resolved Hide resolved
dbt/models/proximity/docs.md Outdated Show resolved Hide resolved
dbt/models/proximity/docs.md Outdated Show resolved Hide resolved
@jeancochrane
Copy link
Contributor Author

@dfsnow Looks like the new table doesn't have quite the same contents as the existing one:

select count(*)
from "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin"

Results: 32370578
select count(*)
from "proximity"."dist_pin_to_pin"

Results: 32420636

That's a difference of about 50k rows; I would be fine with this if the new version had more rows, but it seems concerning that there are fewer.

Further queries confirm that the new version of the table is a subset of the old version:

select count(*)
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where prod.pin10 is null

Results: 0
select count(*)
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where ci.pin10 is null

Results: 53610

Here's a version of the query grouped by year:

select prod.year, count(*)
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where ci.pin10 is null
group by prod.year
order by prod.year desc
year count of missing PINs
2022 2931
2021 2369
2020 2885
2019 2870
2018 2677
2017 2629
2016 2488
2015 2632
2014 2715
2013 2760
2012 2862
2011 2759
2010 2760
2009 2791
2008 2304
2007 2143
2006 2038
2005 1836
2004 1630
2003 1508
2002 1423
2001 1331
2000 1269

What do you think @dfsnow? Is this worth investigating before we merge and overwrite the existing table?

@dfsnow
Copy link
Member

dfsnow commented Sep 26, 2023

What do you think @dfsnow? Is this worth investigating before we merge and overwrite the existing table?

@jeancochrane Do a quick anti-join on pin and let's see if it's easy to diagnose. If it takes more than 30 minutes I'm comfortable just merging it as is, since the 3 neighbors are only used for filling.

@jeancochrane
Copy link
Contributor Author

@dfsnow Not much closer to diagnosing it, but here's some interesting information for you -- only about 5k PINs are missing data for multiple years:

with missing_pins as (
    select prod.pin10 as pin10, prod.year as year
    from "proximity"."dist_pin_to_pin" as prod
    full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
    on prod.pin10 = ci.pin10 and prod.year = ci.year
    where ci.pin10 is null
),

missing_pins_grouped_by_year as (
    select pin10, array_agg(year) as missing_years
    from missing_pins
    group by pin10
    having cardinality(array_agg(year)) > 1
)

select count(*)
from missing_pins_grouped_by_year

Results: 4940

And only 153 PINs are missing data for all years:

select count(*)
from missing_pins_grouped_by_year
where cardinality(missing_years) = 23

Results: 153

This at least makes me feel better that we're not just dropping an entire group of PINs! If you feel fine with this, I'll go ahead and merge.

@dfsnow
Copy link
Member

dfsnow commented Sep 26, 2023

@jeancochrane Likely has something to do with PIN changes across years or fake "999" PINs. Let's just go ahead and merge it.

@jeancochrane
Copy link
Contributor Author

Sounds good, thanks @dfsnow! Mind hitting the Approve button so I can merge?

@dfsnow dfsnow self-requested a review September 26, 2023 19:16
@jeancochrane jeancochrane merged commit fcf69f8 into master Sep 26, 2023
8 checks passed
@jeancochrane jeancochrane deleted the jeancochrane/122-data-catalog-move-proximitydist_pin_to_pin-ctas-query-into-the-dbt-dag branch September 26, 2023 19:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Data catalog] Move proximity.dist_pin_to_pin CTAS query into the dbt DAG
2 participants