-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proximity.dist_pin_to_pin
to the dbt DAG using intermediate transforms
#138
Add proximity.dist_pin_to_pin
to the dbt DAG using intermediate transforms
#138
Conversation
proximity.dist_pin_to_pin
to the DAG using intermediate transforms
proximity.dist_pin_to_pin
to the DAG using intermediate transformsproximity.dist_pin_to_pin
to the dbt DAG using intermediate transforms
…solve missing data
-- * year | ||
-- * x_3435 | ||
-- * y_3435 | ||
{% macro nearest_pin_neighbors(source_model, num_neighbors, radius_km) %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro is nearly identical to the pre-existing aws-athena/ctas/proximity-dist_pin_to_pin.sql
query, with three main changes:
- Parameterization of
source_model
,num_neigbors
, andradius_km
- Addition of
pin10
to thedistinct_pins
subquery per our missing data discussion on Teams (diff here) - Reformatting to match dbt styles using
sqlfmt
Everything else should be identical!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeancochrane Great work! When we visit #130 it may be worth it to see what else we can abstract into macros. See my one blocking issue, but otherwise looks good.
DROP TABLE IF EXISTS proximity.dist_pin_to_pin_temp | ||
DROP TABLE IF EXISTS proximity.dist_pin_to_pin_temp2 | ||
-- View that finds the 3 nearest neighbor PINs for every PIN for every year | ||
SELECT * FROM {{ ref('proximity.dist_pin_to_pin_10km') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (blocking): If dist_pin_to_pin_10km.sql
only contains the values missing from dist_pin_to_pin_1km.sql
, then shouldn't this be the union of dist_pin_to_pin_1km.sql
and dist_pin_to_pin_10km.sql
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dfsnow Looks like the new table doesn't have quite the same contents as the existing one: select count(*)
from "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin"
Results: 32370578 select count(*)
from "proximity"."dist_pin_to_pin"
Results: 32420636 That's a difference of about 50k rows; I would be fine with this if the new version had more rows, but it seems concerning that there are fewer. Further queries confirm that the new version of the table is a subset of the old version: select count(*)
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where prod.pin10 is null
Results: 0 select count(*)
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where ci.pin10 is null
Results: 53610 Here's a version of the query grouped by year: select prod.year, count(*)
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where ci.pin10 is null
group by prod.year
order by prod.year desc
What do you think @dfsnow? Is this worth investigating before we merge and overwrite the existing table? |
@jeancochrane Do a quick anti-join on pin and let's see if it's easy to diagnose. If it takes more than 30 minutes I'm comfortable just merging it as is, since the 3 neighbors are only used for filling. |
@dfsnow Not much closer to diagnosing it, but here's some interesting information for you -- only about 5k PINs are missing data for multiple years: with missing_pins as (
select prod.pin10 as pin10, prod.year as year
from "proximity"."dist_pin_to_pin" as prod
full outer join "ci_jeancochrane-122-data-catalog-move-proximitydist-pin-to-pin-ctas-query-into-the-dbt-dag_proximity"."dist_pin_to_pin" as ci
on prod.pin10 = ci.pin10 and prod.year = ci.year
where ci.pin10 is null
),
missing_pins_grouped_by_year as (
select pin10, array_agg(year) as missing_years
from missing_pins
group by pin10
having cardinality(array_agg(year)) > 1
)
select count(*)
from missing_pins_grouped_by_year
Results: 4940 And only 153 PINs are missing data for all years: select count(*)
from missing_pins_grouped_by_year
where cardinality(missing_years) = 23
Results: 153 This at least makes me feel better that we're not just dropping an entire group of PINs! If you feel fine with this, I'll go ahead and merge. |
@jeancochrane Likely has something to do with PIN changes across years or fake "999" PINs. Let's just go ahead and merge it. |
Sounds good, thanks @dfsnow! Mind hitting the Approve button so I can merge? |
This PR changes
proximity.dist_pin_to_pin
from a source to a ref so that it can be tracked in the dbt DAG. In order to build the table in a reasonable amount of time, we define a series of intermediate transformations, documented indbt/models/proximity/docs.md
, that expand the PIN-to-PIN buffer gradually so that we only compute the 10km buffer for PINs that do not have neighbors within 1km.As evidenced by this successful GitHub workflow run, this new design takes about 14 minutes to build
dist_pin_to_pin
.Closes #122.