Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In some cases, it's desired to handle bulk STAC to GeoParquet conversion in a foolproof way with minimal user input. In #49, I presented an option to handle partial schemas, so the user could define only which STAC extensions they use, but this is still not totally foolproof. In particular, it can fail due to versioning issues, if there are multiple extension versions and/or multiple core versions in a single collection. Additionally, that approach requires more ongoing maintenance to keep the partial schemas up to date as extensions release new versions.
It may still be desired to finish and merge #49, but I think it makes sense to at least include this "exhaustive" inference as an option, because it's the most foolproof approach (though time consuming). In this PR, we implement a full scan over the input data, which infers a single unified Arrow schema before converting any data to Parquet.
I tested this with the horrible data I got stuck with last fall at the STAC Sprint. I had fetched 10,000 STAC Items from AWS's Sentinel 2 STAC collection, which had a variety of versions of STAC and different assets in each item. Though with this horrid input it does create a pretty insane schema... in this case with 52 separate asset keys 😱 , it works without any user input! (The pyarrow text repr of this schema is 90,000 characters, which means it's too big to paste into a comment 😂 )
Change list
InferredSchema
class. This is intended to be used to iteratively build up a schema while scanning input JSON data.InferredSchema
input to theschema
argument.parse_stac_ndjson_to_arrow
to return a Generator of arrow RecordBatches.parse_stac_ndjson_to_arrow
to take in one or more paths to newline-delimited JSON files.parse_stac_ndjson_to_parquet
which wrapsparse_stac_ndjson_to_arrow
to construct a single GeoParquet file from input ndjson data. This streams data, and so does not hold all input data in memory at once. In the case where the user does not pass in a schema, it first scans the entire input to infer a unified schema, then creates the record batch generator.proj:geometry
key, which is also a GeoJSON, and which should also be WKB for the same reasons. This fixes that for both conversion to arrow and from arrow back to JSON.