Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Update pipeline to use trip_id as unique key #160

Merged
merged 4 commits into from
Aug 28, 2023
Merged

Conversation

rymarczy
Copy link
Collaborator

@rymarczy rymarczy commented Aug 22, 2023

This change modifies the data processing pipeline to use trip_id in the unique identifier for trip-stops.

This is required to process GTFS-RT Vehicle Postion files that do no report start_date and start_time values.

The pipeline_flat_out.csv test file required 2 updates because of this pipeline change:

  • Trip ID 55713710 vehicle label field switch from 3868-3696 to 3696-3819-3662 (The trip reports both labels evenly for the duration of the trip`
  • Trip ID 1683547929 parent station place-pktrm trunk headway changed to 410 seconds to match branch headway duration

Asana Task: https://app.asana.com/0/1204931901750665/1205259099679735

"direction_id",
"start_time",
"vehicle_id",
"trip_id",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking): not sure it matters for you all, but a route is an attribute of a trip, so that's already unique: you don't need both items.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have examples in our test set where the same trip_id continues across more than one Green Line route_id for the same start_date.

I'm not sure how common this is, but it does occur, and I believe the unique trip designations we are looking for should include route_id as a guarantee.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment has been added to the function call to describe this behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have details about when you've seen that, it would be helpful: I don't think that should be happening (at least not within a single start_date).

Copy link
Collaborator Author

@rymarczy rymarczy Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our flat file test data is a random sample from May 8th of this year. Two trip ids exhibit this behavior (ADDED-1581518542, ADDED-1581518549).

CSV file with the GTFS-RT Vehicle Position data is attached.

non_unique_trip_id.csv

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, something like this makes sense for ADDED trips (although probably that reflects a bug in RTR). I'm still not sure why route_id needs to be included in the unique constraint though (given that it's possible for some of the other items in there to be non-unique as well), but I'll defer to the Lamp team.

TempEventCompare.direction_id,
TempEventCompare.start_time,
TempEventCompare.trip_id,
TempEventCompare.stop_sequence.desc(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this backwards? stop sequences start with low numbers and go up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this query is to collect additional information fields ( vehicle_label, vehicle_consist) for unique trips that will be used for UPDATE/INSERT operations into our vehicle_trips table.

For this first stop-event in a trip, this information is frequently carried over from the last trip for the vehicle, so this query is collecting the latest stop-event values from these information fields to UPDATE/INSERT into our vehicle_trips table.

That is why the desc() ORDER is used, it would probably be helpful to add a comment to this effect above these calls.

Copy link
Contributor

@mzappitello mzappitello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few comments in here, some questions some suggestions.

seems like our csv stuff isn't lining up tho.

"direction_id",
"start_time",
"vehicle_id",
"trip_id",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to keep route_id in here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have examples in our test set where the same trip_id continues across more than one Green Line route_id for the same start_date.

I'm not sure how common this is, but it does occur, and I believe the unique trip designations we are looking for should include route_id as a guarantee.

TempEventCompare.direction_id,
TempEventCompare.start_time,
TempEventCompare.trip_id,
TempEventCompare.stop_sequence.desc(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why add this one into the ordering?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this query is to collect additional information fields ( vehicle_label, vehicle_consist) for unique trips that will be used for UPDATE/INSERT operations into our vehicle_trips table.

For this first stop-event in a trip, this information is frequently carried over from the last trip for the vehicle, so this query is collecting the latest stop-event values from these information fields to UPDATE/INSERT into our vehicle_trips table.

That is why the desc() ORDER is used, it would probably be helpful to add a comment to this effect above these calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment has been added to the function call to describe this behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the changes here? they seem to be in places i wouldn't think would effect anything?

order_by=sa.func.coalesce(
rt_trips_sub.c.vp_move_timestamp,
rt_trips_sub.c.vp_stop_timestamp,
rt_trips_sub.c.tu_stop_timestamp,
),
order_by=rt_trips_sub.c.vp_move_timestamp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was this change necessary as well if we're running the VACUUM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change wasn't necessary, but looking at the field that's being calculated, and how it's used. I don't believe the coalesce is needed.

I think this business logic was a little bit of a hold over from before Ops Analytics decided they wanted all headways as departure to departure calculations.

@mzappitello
Copy link
Contributor

LGTM 🍰

@rymarczy rymarczy merged commit 511fbc8 into main Aug 28, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants