Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Column-based storage for GTFS entities #1747

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bdferris-v2
Copy link
Collaborator

Per discussion in #1358 and GTFS Validator - Memory Reduction, this PR implements support for column-based storage of GTFS entities. This technique supports reduction in the validators memory footprint by avoiding the memory usage of unused columns.

This PR is not yet ready for review but is meant to show what the implementation might look like.

See the implementation report for details on memory savings and performance.

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with gradle test to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

github-actions bot commented May 6, 2024

✅ Rule acceptance tests passed.
New Errors: 1 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
Dropped Errors: 2 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
New Warnings: 1 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
Dropped Warnings: 0 out of 1520 datasets (~0%) are invalid due to code change, which is less than the provided threshold of 1%.
0 out of 1520 sources (~0 %) are corrupted.
Commit: 337aa15
Download the full acceptance test report here (report will disappear after 90 days).
✅ Rule acceptance tests passed.

@jcpitre
Copy link
Contributor

jcpitre commented Jun 3, 2024

Impressive work.
In general I am a bit concerned with the added complexity vs memory savings.

GtfsContinuousPickupDropOff continuousPickup();

@DefaultValue("1")
@UnusedValue
GtfsContinuousPickupDropOff continuousDropOff();

@NonNegative
Copy link
Contributor

@jcpitre jcpitre Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a problem with one of the datasets I used for testing (https://storage.googleapis.com/storage/v1/b/mdb-latest/o/ar-cordoba-mar-chiquita-srl-gtfs-1146.zip?alt=media).
In this data set, the shape_dist_traveled was empty except for one row that had a 0 in it.
Because of this a column was created with all values 0.
Maybe we can leverage the DefaultValue annotation and create a column only if a value in the file is different from the default?
Also the fact that the column existed created a bunch of decreasing_or_equal_stop_time_distance notices, because there was no increase of the distance. They were all 0. This would not have happened if hasShapeDistTraveled() returned false, but it returned true.

implementation 'javax.inject:javax.inject:1'
implementation 'com.google.guava:guava:31.0.1-jre'
implementation 'com.google.code.findbugs:jsr305:3.0.2'
testImplementation 'org.junit.jupiter:junit-jupiter-api:5.8.1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in the scope of this PR, we should align all junit version in the project for consistency

return new AllEntitiesListImpl();
}

private class AllEntitiesListImpl extends AbstractList<T> implements HasFactory<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, AnyTableLoader doesn't add entities with parsing errors, meaning that the row index won't match after the first parsing errors see. This will cause all rows after the first error to be not loaded and ignored by the single and multiple file validators. A potential fix can be adding all entities, even the ones with unparsable errors. This goes against the idea of saving space on unused data; in this case, validators need to be aware of the "unparsable" row. Another possible fix is to use createSomeEntitiesList.

(useColumnBasedStorage
? columnDescriptor.columnBasedEntityBuilderSetter()
: columnDescriptor.entityBuilderSetter());
if (useColumnBasedStorage && columnDescriptor.unusedValue()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[question]: To support future extensions that uses unused fields, how can we have a dynamic override of this behavior and load unused columns?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants