-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: CSV Flatfile Pipeline Testing #154
Conversation
c195dc4
to
da80368
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two small change requests, but it looks good to me.
parquet_file = os.path.join( | ||
springboard_dir, parquet_folder, "flat_file.parquet" | ||
) | ||
os.makedirs(os.path.join(springboard_dir, parquet_folder), exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should use the pytest temp directory fixture instead of os mkdir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated with pytest temp_path
fixture.
) | ||
|
||
compare_result = db_result_df.compare(csv_result_df, align_axis=1) | ||
print(compare_result, flush=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only print this if the assert fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorporated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🍰
This PR provides a new testing method
test_whole_table
. This method takes a CSV representation of Vehicle Position records, converts it to a Parquet file on the fly and runs it throughprocess_gtfs_rt_files
to populate GTFS-RT RDS tables.After Processing, a query selects records from the RDS and compares them against a CSV results file to confirm that there have been no fundamental changes to the Performance Manager processing pipelines.
The input CSV file consists of mostly whole route trips, 3 in each direction for each MBTA Rail line.
This PR required the updating of GTFS and GTFS-RT test files in the
SPRINGBOARD
test_files
folder of the repository to match that date of records from the flat file (May 8th 2023). In an effort to reduce the repository size, all non rail data has also been stripped from theSPRINGBOARD
parquet files ofRT_TRIP_UPDATES
andRT_VEHICLE_POSITIONS
.This Branch was developed with a commit from prior to the inclusion of the PR that removed hash columns from the application. This was done to provide a check against the validity of the changes introduced by that PR.
Asana Task: https://app.asana.com/0/1204931901750655/1205084207879142