Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigating issues with parsing Flex feeds #1767

Closed
emmambd opened this issue May 21, 2024 · 6 comments
Closed

Investigating issues with parsing Flex feeds #1767

emmambd opened this issue May 21, 2024 · 6 comments
Assignees
Labels
flex Rules and rule changes related to GTFS-Flex.

Comments

@emmambd
Copy link
Contributor

emmambd commented May 21, 2024

What's the problem?

Out of the 4 Flex feeds that we have for testing purposes for #1721, 3 have failed to run through the validator without parsing issues.

I took a look at 51 Flex v2 feeds, including ones that don't conform to the official spec yet, for the sake of trying to better understand this problem. 50% fail to fully parse, and all but 1 of the feeds that failed have an issue with parsing stop_times.txt.

Outstanding questions

  • How common are stop_times.txt parsing failures with the validator now, just looking at the GTFS Schedule feeds currently in the Mobility Database?
  • How big are the stop_times.txt files for the feeds that fail?
    1KB, 12KB, 2.4MB
  • What changes would we need to make to ensure that Flex feeds can be validated successfully? Are there incremental changes that are feasible, or do we need a major infrastructure change, e.g the one suggested in feat: Column-based storage for GTFS entities #1747?
    No major infra change needed. We need to remove errors like UNKNOWN_COLUMN from UNPARSABLE_ROWS. However, this might not be necessary because we are adding the Flex rules. Explore running validation on feeds with unknown_column notices #1770

This is a critical set of questions to answer before we pursue more work on #1721

@emmambd emmambd added the flex Rules and rule changes related to GTFS-Flex. label May 21, 2024
@qcdyx
Copy link
Contributor

qcdyx commented May 22, 2024

In code, we encountered UNPARSABLE_ROWS due to validation errors while processing the rows of GTFS files.
For example, stop_times.txt had errors such as unknown_column and missing_required_field.image
For agency.txt, there's invalid_timezone and invalid_url ERROR.
image

@emmambd emmambd added this to the Flex: modifying pre-existing rules milestone May 23, 2024
@qcdyx
Copy link
Contributor

qcdyx commented May 27, 2024

@qcdyx
Copy link
Contributor

qcdyx commented May 29, 2024

Based on the investigation on #1770 , it's the missing_required_field, invalid_url, and invalid_timezone that lead to validation errors and make a GTFS file unparsable.

@emmambd
Copy link
Contributor Author

emmambd commented May 29, 2024

Moving @qcdyx findings from #1770 here:

It's the missing_required_field 'stop_id' that leads to a validation error, which makes stop_times.txt have a status of UNPARSABLE_ROWS.
added a 'UNKNOWN_COLUMN' to stop_times.txt of browncounty-mn-us--flex-v2 dataset, run GTFS validator, no UNPARSABLE_ROWS for stop_times.txt.

We're only planning to modify the logic of missing_required_field for Flex feeds, not invalid_url or invalid_timezone. I think we'd proceed by continuing the work in #1721 and see how often these feeds fail to parse files by completing #1775 cc @davidgamez @qcdyx

@davidgamez
Copy link
Member

Moving @qcdyx findings from #1770 here:

It's the missing_required_field 'stop_id' that leads to a validation error, which makes stop_times.txt have a status of UNPARSABLE_ROWS.
added a 'UNKNOWN_COLUMN' to stop_times.txt of browncounty-mn-us--flex-v2 dataset, run GTFS validator, no UNPARSABLE_ROWS for stop_times.txt.

We're only planning to modify the logic of missing_required_field for Flex feeds, not invalid_url or invalid_timezone. I think we'd proceed by continuing the work in #1721 and see how often these feeds fail to parse by completing #1775 cc @davidgamez @qcdyx

For clarification, when an unparsable error is triggered, it only affects single file validators for the referred file. In this case only agency.txt validators are affected.

@emmambd emmambd modified the milestones: Flex: modifying pre-existing rules, 6.0 Validator Release May 30, 2024
@emmambd
Copy link
Contributor Author

emmambd commented Sep 10, 2024

From the findings from #1749, it looks like this is not an issue now that missing_required_field has been modified. cc @jcpitre

@emmambd emmambd closed this as completed Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flex Rules and rule changes related to GTFS-Flex.
Projects
None yet
Development

No branches or pull requests

3 participants