-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Column-based storage for GTFS entities #1747
base: master
Are you sure you want to change the base?
Conversation
|
✅ Rule acceptance tests passed. |
Impressive work. |
GtfsContinuousPickupDropOff continuousPickup(); | ||
|
||
@DefaultValue("1") | ||
@UnusedValue | ||
GtfsContinuousPickupDropOff continuousDropOff(); | ||
|
||
@NonNegative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem with one of the datasets I used for testing (https://storage.googleapis.com/storage/v1/b/mdb-latest/o/ar-cordoba-mar-chiquita-srl-gtfs-1146.zip?alt=media).
In this data set, the shape_dist_traveled was empty except for one row that had a 0 in it.
Because of this a column was created with all values 0.
Maybe we can leverage the DefaultValue annotation and create a column only if a value in the file is different from the default?
Also the fact that the column existed created a bunch of decreasing_or_equal_stop_time_distance notices, because there was no increase of the distance. They were all 0. This would not have happened if hasShapeDistTraveled() returned false, but it returned true.
implementation 'javax.inject:javax.inject:1' | ||
implementation 'com.google.guava:guava:31.0.1-jre' | ||
implementation 'com.google.code.findbugs:jsr305:3.0.2' | ||
testImplementation 'org.junit.jupiter:junit-jupiter-api:5.8.1' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in the scope of this PR, we should align all junit version in the project for consistency
return new AllEntitiesListImpl(); | ||
} | ||
|
||
private class AllEntitiesListImpl extends AbstractList<T> implements HasFactory<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, AnyTableLoader doesn't add entities with parsing errors, meaning that the row index won't match after the first parsing errors see. This will cause all rows after the first error to be not loaded and ignored by the single and multiple file validators. A potential fix can be adding all entities, even the ones with unparsable errors. This goes against the idea of saving space on unused data; in this case, validators need to be aware of the "unparsable" row. Another possible fix is to use createSomeEntitiesList.
(useColumnBasedStorage | ||
? columnDescriptor.columnBasedEntityBuilderSetter() | ||
: columnDescriptor.entityBuilderSetter()); | ||
if (useColumnBasedStorage && columnDescriptor.unusedValue()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[question]: To support future extensions that uses unused fields, how can we have a dynamic override of this behavior and load unused columns?
Per discussion in #1358 and GTFS Validator - Memory Reduction, this PR implements support for column-based storage of GTFS entities. This technique supports reduction in the validators memory footprint by avoiding the memory usage of unused columns.
This PR is not yet ready for review but is meant to show what the implementation might look like.
See the implementation report for details on memory savings and performance.
Please make sure these boxes are checked before submitting your pull request - thanks!
gradle test
to make sure you didn't break anything