Use iterables / and record batches in to_arrow #53

bitner · 2024-05-15T15:50:05Z

User orjson rather than json for speed.
Allow items to be passed in as an iterable
Use RecordBatches for processing
- Since we are processing on record batches, don't use chunked arrays
Add function for reading json that can read either ndjson or items in a json file either as a root list or FeatureCollection

… items and record batches for converting to arrow

stac_geoparquet/arrow/_to_arrow.py

kylebarron

This will have some merge conflicts with #50 but overall this looks great 🙏

bitner · 2024-05-15T18:51:21Z

@kylebarron OK, merged in the changes from #50 and converted some more things to using batches. I did change up the full schema sniffing away from using the InferredSchema class. I think with using the parsers that return batches it is a bit more straight forward to just get the batches and check the schemas there, but am happy to revert back to using the class. Tests pass, but I haven't done a round to make sure there aren't any refactored functions that need to be removed or anything like that.

kylebarron

A few comments around schemas, downcasting, and reading JSON

stac_geoparquet/arrow/_to_arrow.py

kylebarron · 2024-05-21T02:23:15Z

stac_geoparquet/arrow/_to_arrow.py

    *,
    chunk_size: int = 8192,
-    schema: Optional[Union[pa.Schema, InferredSchema]] = None,


Why remove the InferredSchema support here? I had intended InferredSchema to act the same as a pa.Schema, but with stronger typing (e.g. InferredSchema is just a newtype around a schema)

stac_geoparquet/json_reader.py

kylebarron · 2024-05-21T02:38:04Z

I did change up the full schema sniffing away from using the InferredSchema class

Oh I see now you removed it entirely. I was thinking it would be useful to have some class where we can assign methods onto it. E.g. InferredSchema.to_stac_arrow_schema(), or some name like that, which would apply all the STAC-GeoParquet schema transformations, and manage the transition between the low-level schema applied to input and the Parquet schema used for saving output

kylebarron · 2024-05-21T14:56:58Z

Add a limited scan of first N JSON items to infer schema

kylebarron · 2024-05-21T19:21:00Z

stac_geoparquet/arrow/_to_parquet.py

+    if schema is None:
+        unified_batches = []
+        for batch in batches:
+            if schema is None:
+                schema = batch.schema
+            else:
+                schema = pa.unify_schemas(
+                    [schema, batch.schema], promote_options="permissive"
+                )
+            unified_batches.append(update_batch_schema(batch, schema))
+        batches = unified_batches
+
+    assert schema is not None
+    schema = schema.with_metadata(_create_geoparquet_metadata())
+
    with pq.ParquetWriter(output_path, schema, **kwargs) as writer:
-        writer.write_batch(first_batch)
-        for batch in batches_iter:
+        for batch in batches:


Just for the record this doesn't work. You can't change the schema you pass to ParquetWriter without also changing the physical schema of the arrow data (I don't know how you do that, maybe with pyarrow.compute.cast or table.cast)

import pyarrow.parquet as pq import pyarrow as pa table = pa.table({"a": [1, 2, 3, 4]}) assert pa.types.is_int64(table.schema.field("a")) with pq.ParquetWriter("test.parquet", pa.schema([pa.field("a", pa.int32())])) as writer: writer.write_table(table)

gives

ValueError: Table schema does not match schema used to create file: table: a: int64 vs. file: a: int32

Separately, this code was already exhausting the iterator, so it wouldn't have had batches to actually write to parquet

kylebarron · 2024-05-22T11:05:15Z

Add a limited scan of first N JSON items to infer schema

The last commit added a limit parameter to parse_stac_ndjson_to_arrow

add json reader that can accomodate json and ndjson, user iterable of…

5bca2cb

… items and record batches for converting to arrow

kylebarron reviewed May 15, 2024

View reviewed changes

stac_geoparquet/arrow/_to_arrow.py Outdated Show resolved Hide resolved

kylebarron reviewed May 15, 2024

View reviewed changes

stac_geoparquet/arrow/_to_arrow.py Outdated Show resolved Hide resolved

kylebarron reviewed May 15, 2024

View reviewed changes

only use batch, not union

6b26022

kylebarron mentioned this pull request May 15, 2024

Exhaustive schema inference #50

Merged

kylebarron approved these changes May 15, 2024

View reviewed changes

merge main, convert more functions to use batches

0de1dec

bitner marked this pull request as ready for review May 15, 2024 18:51

bitner requested a review from kylebarron May 15, 2024 18:51

kylebarron reviewed May 21, 2024

View reviewed changes

kylebarron added 5 commits May 21, 2024 13:46

remove downcast parameter

541ad1b

pass along schema

c935973

Fix reading json

27030ce

restore inferredSchema class

4cb16da

simplify chunked json reading

18d23ad

kylebarron reviewed May 21, 2024

View reviewed changes

kylebarron added 4 commits May 21, 2024 15:29

Fixed schema inference

53cd06f

handle iterable output

a5b95fe

fix error in refactoring

a1a23e0

Add schema inference with limit

268aab6

kylebarron merged commit 22abad6 into stac-utils:main May 22, 2024
1 check passed

zstatmanweil mentioned this pull request May 29, 2024

pyarrow error when running the example in README #55

Closed

kylebarron mentioned this pull request Jun 3, 2024

Refactor interally to RawBatch and CleanBatch wrapper types #57

Merged

kylebarron changed the title ~~User iterables / and record batches in to_arrow~~ Use iterables / and record batches in to_arrow Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use iterables / and record batches in to_arrow #53

Use iterables / and record batches in to_arrow #53

bitner commented May 15, 2024

kylebarron left a comment

bitner commented May 15, 2024

kylebarron left a comment

kylebarron May 21, 2024

kylebarron commented May 21, 2024

kylebarron commented May 21, 2024

kylebarron May 21, 2024 •

edited

Loading

kylebarron commented May 22, 2024

Use iterables / and record batches in to_arrow #53

Use iterables / and record batches in to_arrow #53

Conversation

bitner commented May 15, 2024

kylebarron left a comment

Choose a reason for hiding this comment

bitner commented May 15, 2024

kylebarron left a comment

Choose a reason for hiding this comment

kylebarron May 21, 2024

Choose a reason for hiding this comment

kylebarron commented May 21, 2024

kylebarron commented May 21, 2024

kylebarron May 21, 2024 • edited Loading

Choose a reason for hiding this comment

kylebarron commented May 22, 2024

kylebarron May 21, 2024 •

edited

Loading