Write to Delta Lake #58

kylebarron · 2024-06-03T10:48:33Z

This PR adds a new function parse_stac_ndjson_to_delta_lake to convert a JSON source to a Delta Lake table. It is based on #57, so only look at the most recent commits, and that PR should be merged first.

There's a complication here: Delta Lake refuses to write any column inferred with data type null, with:

_internal.SchemaMismatchError: Invalid data type for Delta Lake: Null

This is a problem because if all items in a STAC Collection have a null JSON key, it gets inferred as an Arrow null type. For example, in the 3dep-lidar-copc collection in the tests, it has start_datetime and end_datetime fields, and so according to the spec, datetime is always null. This means we cannot write this collection to Delta Lake solely with automatic schema inference.

In the latest commit I started to implement some manual schema modifications for datetime and proj:epsg, which fixed the error for 3dep-lidar-copc. But 3dep-lidar-dsm has more fields that are inferred as null. In particular the schema paths:

properties.raster:bands.pdal_pipeline.[].filename
properties.raster:bands.pdal_pipeline.[].resolution

are both null. It's not ideal to hard-code manual overrides for every extension, so we should discuss how to handle this.

Possible options:

Remove null fields from the JSON before reading. This would be easier to fit into an Arrow schema, but would lose meaning. E.g. proj:epsg set to null has specific semantic meaning that we don't want to lose.
Hard-code manual schema corrections for well-known data types, like datetime, proj:epsg, etc.
Allow an error on writing to Delta Lake and require the user to pass in a schema.

bitner · 2024-06-04T17:18:34Z

stac_geoparquet/arrow/_delta_lake.py

+    from deltalake import DeltaTable
+
+
+def parse_stac_ndjson_to_delta_lake(


should we make a generic arrow_batches_to_delta_lake that takes in batches and then just make the parse_stac_ndjson_to_delta_lake a wrapper around that?

well, that generic arrow_batches_to_delta_lake function is literally just write_deltalake. You can pass an Iterable[pa.RecordBatch] directly to it. (You just also need to know the schema separately. Are you suggesting a helper that takes the first batch and passes its schema to write_deltalake?)

kylebarron · 2024-06-04T19:00:02Z

This PR should be ready to go, where we don't yet solve the null type issue, but rather for now require in those cases that the user handle schema resolution manually.

In a follow up PR we may want to consider defaulting null types to string, but that may complicate schema evolution if later data has non-null values for STAC keys.

kylebarron added 12 commits June 2, 2024 18:28

Move json equality logic outside of test_arrow

54d625c

Refactor to RawBatch and CleanBatch wrapper types

a0433c5

Move _from_arrow functions to _api

7fabc9a

Update imports

46295d0

fix circular import

fa226d4

keep deprecated api

cc7beec

Add write-read test and fix typing

6060644

add parquet tests

4c5d08b

fix ci

14a6bc9

Initial delta lake support

c4d712b

Manual schema updates

0ccceca

Add delta lake dep

612a8eb

kylebarron requested a review from TomAugspurger June 3, 2024 11:19

bitner reviewed Jun 4, 2024

View reviewed changes

kylebarron added 3 commits June 4, 2024 14:39

Merge branch 'main' into kyle/delta-lake-interop

4907d4c

Fix export

baae396

fix pyupgrade lint

66eec94

kylebarron added 2 commits June 4, 2024 15:17

Add type hints

c724bc5

any typing

6a9704b

kylebarron merged commit e43398b into main Jun 5, 2024
1 check passed

TomAugspurger deleted the kyle/delta-lake-interop branch June 7, 2024 14:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write to Delta Lake #58

Write to Delta Lake #58

kylebarron commented Jun 3, 2024

bitner Jun 4, 2024

kylebarron Jun 4, 2024

kylebarron commented Jun 4, 2024

		from deltalake import DeltaTable


		def parse_stac_ndjson_to_delta_lake(

Write to Delta Lake #58

Write to Delta Lake #58

Conversation

kylebarron commented Jun 3, 2024

bitner Jun 4, 2024

Choose a reason for hiding this comment

kylebarron Jun 4, 2024

Choose a reason for hiding this comment

kylebarron commented Jun 4, 2024