-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Update pipeline to use trip_id
as unique key
#160
Conversation
"direction_id", | ||
"start_time", | ||
"vehicle_id", | ||
"trip_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(non-blocking): not sure it matters for you all, but a route is an attribute of a trip, so that's already unique: you don't need both items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have examples in our test set where the same trip_id
continues across more than one Green Line route_id
for the same start_date
.
I'm not sure how common this is, but it does occur, and I believe the unique trip designations we are looking for should include route_id
as a guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment has been added to the function call to describe this behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have details about when you've seen that, it would be helpful: I don't think that should be happening (at least not within a single start_date).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our flat file test data is a random sample from May 8th of this year. Two trip ids exhibit this behavior (ADDED-1581518542, ADDED-1581518549).
CSV file with the GTFS-RT Vehicle Position data is attached.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, something like this makes sense for ADDED trips (although probably that reflects a bug in RTR). I'm still not sure why route_id
needs to be included in the unique constraint though (given that it's possible for some of the other items in there to be non-unique as well), but I'll defer to the Lamp team.
TempEventCompare.direction_id, | ||
TempEventCompare.start_time, | ||
TempEventCompare.trip_id, | ||
TempEventCompare.stop_sequence.desc(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is this backwards? stop sequences start with low numbers and go up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this query is to collect additional information fields ( vehicle_label
, vehicle_consist
) for unique trips that will be used for UPDATE/INSERT operations into our vehicle_trips
table.
For this first stop-event in a trip, this information is frequently carried over from the last trip for the vehicle, so this query is collecting the latest stop-event values from these information fields to UPDATE/INSERT into our vehicle_trips
table.
That is why the desc()
ORDER is used, it would probably be helpful to add a comment to this effect above these calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few comments in here, some questions some suggestions.
seems like our csv stuff isn't lining up tho.
"direction_id", | ||
"start_time", | ||
"vehicle_id", | ||
"trip_id", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to keep route_id
in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have examples in our test set where the same trip_id continues across more than one Green Line route_id for the same start_date.
I'm not sure how common this is, but it does occur, and I believe the unique trip designations we are looking for should include route_id as a guarantee.
TempEventCompare.direction_id, | ||
TempEventCompare.start_time, | ||
TempEventCompare.trip_id, | ||
TempEventCompare.stop_sequence.desc(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add this one into the ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The purpose of this query is to collect additional information fields ( vehicle_label, vehicle_consist) for unique trips that will be used for UPDATE/INSERT operations into our vehicle_trips table.
For this first stop-event in a trip, this information is frequently carried over from the last trip for the vehicle, so this query is collecting the latest stop-event values from these information fields to UPDATE/INSERT into our vehicle_trips table.
That is why the desc() ORDER is used, it would probably be helpful to add a comment to this effect above these calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment has been added to the function call to describe this behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the changes here? they seem to be in places i wouldn't think would effect anything?
order_by=sa.func.coalesce( | ||
rt_trips_sub.c.vp_move_timestamp, | ||
rt_trips_sub.c.vp_stop_timestamp, | ||
rt_trips_sub.c.tu_stop_timestamp, | ||
), | ||
order_by=rt_trips_sub.c.vp_move_timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was this change necessary as well if we're running the VACUUM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change wasn't necessary, but looking at the field that's being calculated, and how it's used. I don't believe the coalesce is needed.
I think this business logic was a little bit of a hold over from before Ops Analytics decided they wanted all headways as departure to departure calculations.
LGTM 🍰 |
This change modifies the data processing pipeline to use
trip_id
in the unique identifier for trip-stops.This is required to process GTFS-RT Vehicle Postion files that do no report
start_date
andstart_time
values.The
pipeline_flat_out.csv
test file required 2 updates because of this pipeline change:55713710
vehicle label field switch from3868-3696
to3696-3819-3662
(The trip reports both labels evenly for the duration of the trip`Trip ID1683547929
parent stationplace-pktrm
trunk headway changed to 410 seconds to match branch headway durationAsana Task: https://app.asana.com/0/1204931901750665/1205259099679735