feat: Column-based storage for GTFS entities #1747

bdferris-v2 · 2024-05-06T17:00:21Z

Per discussion in #1358 and GTFS Validator - Memory Reduction, this PR implements support for column-based storage of GTFS entities. This technique supports reduction in the validators memory footprint by avoiding the memory usage of unused columns.

This PR is not yet ready for review but is meant to show what the implementation might look like.

See the implementation report for details on memory savings and performance.

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with gradle test to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

…tfs-validator into issue/1358/memory

CLAassistant · 2024-05-06T17:00:29Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

github-actions · 2024-05-06T17:48:49Z

✅ Rule acceptance tests passed.
New Errors: 1 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
Dropped Errors: 2 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
New Warnings: 1 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
Dropped Warnings: 0 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
0 out of 1520 sources (~0 %) are corrupted.
Commit: 337aa15
Download the full acceptance test report here (report will disappear after 90 days).
✅ Rule acceptance tests passed.

jcpitre · 2024-06-03T16:15:04Z

Impressive work.
In general I am a bit concerned with the added complexity vs memory savings.

jcpitre · 2024-06-03T16:26:09Z

main/src/main/java/org/mobilitydata/gtfsvalidator/table/GtfsStopTimeSchema.java

  GtfsContinuousPickupDropOff continuousPickup();

  @DefaultValue("1")
+  @UnusedValue
  GtfsContinuousPickupDropOff continuousDropOff();

  @NonNegative


There was a problem with one of the datasets I used for testing (https://storage.googleapis.com/storage/v1/b/mdb-latest/o/ar-cordoba-mar-chiquita-srl-gtfs-1146.zip?alt=media).
In this data set, the shape_dist_traveled was empty except for one row that had a 0 in it.
Because of this a column was created with all values 0.
Maybe we can leverage the DefaultValue annotation and create a column only if a value in the file is different from the default?
Also the fact that the column existed created a bunch of decreasing_or_equal_stop_time_distance notices, because there was no increase of the distance. They were all 0. This would not have happened if hasShapeDistTraveled() returned false, but it returned true.

davidgamez · 2024-05-28T20:03:00Z

extensions/build.gradle

+    implementation 'javax.inject:javax.inject:1'
+    implementation 'com.google.guava:guava:31.0.1-jre'
+    implementation 'com.google.code.findbugs:jsr305:3.0.2'
+    testImplementation 'org.junit.jupiter:junit-jupiter-api:5.8.1'


Not in the scope of this PR, we should align all junit version in the project for consistency

davidgamez · 2024-05-28T22:03:42Z

core/src/main/java/org/mobilitydata/gtfsvalidator/columns/GtfsColumnBasedCollectionFactory.java

+    return new AllEntitiesListImpl();
+  }
+
+  private class AllEntitiesListImpl extends AbstractList<T> implements HasFactory<T> {


Currently, AnyTableLoader doesn't add entities with parsing errors, meaning that the row index won't match after the first parsing errors see. This will cause all rows after the first error to be not loaded and ignored by the single and multiple file validators. A potential fix can be adding all entities, even the ones with unparsable errors. This goes against the idea of saving space on unused data; in this case, validators need to be aware of the "unparsable" row. Another possible fix is to use createSomeEntitiesList.

davidgamez · 2024-06-03T16:11:10Z

core/src/main/java/org/mobilitydata/gtfsvalidator/table/AnyTableLoader.java

+            (useColumnBasedStorage
+                ? columnDescriptor.columnBasedEntityBuilderSetter()
+                : columnDescriptor.entityBuilderSetter());
+    if (useColumnBasedStorage && columnDescriptor.unusedValue()) {


[question]: To support future extensions that uses unused fields, how can we have a dynamic override of this behavior and load unused columns?

bdferris added 3 commits May 5, 2024 22:40

Column-based data representation.

92d2fcd

Column-based data representation.

9c32bab

Merge branch 'issue/1358/memory' of https://github.com/MobilityData/g…

2349000

…tfs-validator into issue/1358/memory

emmambd mentioned this pull request May 23, 2024

Investigating issues with parsing Flex feeds #1767

Closed

jcpitre requested review from jcpitre and davidgamez May 27, 2024 16:05

jcpitre reviewed Jun 3, 2024

View reviewed changes

davidgamez reviewed Jun 3, 2024

View reviewed changes

jcpitre mentioned this pull request Sep 12, 2024

Optimisation: Do not run validators on columns that are not present #1839

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Column-based storage for GTFS entities #1747

feat: Column-based storage for GTFS entities #1747

bdferris-v2 commented May 6, 2024

CLAassistant commented May 6, 2024

github-actions bot commented May 6, 2024

jcpitre commented Jun 3, 2024

jcpitre Jun 3, 2024 •

edited

Loading

davidgamez May 28, 2024

davidgamez May 28, 2024

davidgamez Jun 3, 2024

feat: Column-based storage for GTFS entities #1747

Are you sure you want to change the base?

feat: Column-based storage for GTFS entities #1747

Conversation

bdferris-v2 commented May 6, 2024

CLAassistant commented May 6, 2024

github-actions bot commented May 6, 2024

jcpitre commented Jun 3, 2024

jcpitre Jun 3, 2024 • edited Loading

Choose a reason for hiding this comment

davidgamez May 28, 2024

Choose a reason for hiding this comment

davidgamez May 28, 2024

Choose a reason for hiding this comment

davidgamez Jun 3, 2024

Choose a reason for hiding this comment

jcpitre Jun 3, 2024 •

edited

Loading