Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write to Delta Lake #58

Merged
merged 17 commits into from
Jun 5, 2024
Merged

Write to Delta Lake #58

merged 17 commits into from
Jun 5, 2024

Conversation

kylebarron
Copy link
Collaborator

This PR adds a new function parse_stac_ndjson_to_delta_lake to convert a JSON source to a Delta Lake table. It is based on #57, so only look at the most recent commits, and that PR should be merged first.

There's a complication here: Delta Lake refuses to write any column inferred with data type null, with:

_internal.SchemaMismatchError: Invalid data type for Delta Lake: Null

This is a problem because if all items in a STAC Collection have a null JSON key, it gets inferred as an Arrow null type. For example, in the 3dep-lidar-copc collection in the tests, it has start_datetime and end_datetime fields, and so according to the spec, datetime is always null. This means we cannot write this collection to Delta Lake solely with automatic schema inference.

In the latest commit I started to implement some manual schema modifications for datetime and proj:epsg, which fixed the error for 3dep-lidar-copc. But 3dep-lidar-dsm has more fields that are inferred as null. In particular the schema paths:

properties.raster:bands.pdal_pipeline.[].filename
properties.raster:bands.pdal_pipeline.[].resolution

are both null. It's not ideal to hard-code manual overrides for every extension, so we should discuss how to handle this.

Possible options:

  • Remove null fields from the JSON before reading. This would be easier to fit into an Arrow schema, but would lose meaning. E.g. proj:epsg set to null has specific semantic meaning that we don't want to lose.
  • Hard-code manual schema corrections for well-known data types, like datetime, proj:epsg, etc.
  • Allow an error on writing to Delta Lake and require the user to pass in a schema.

from deltalake import DeltaTable


def parse_stac_ndjson_to_delta_lake(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make a generic arrow_batches_to_delta_lake that takes in batches and then just make the parse_stac_ndjson_to_delta_lake a wrapper around that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, that generic arrow_batches_to_delta_lake function is literally just write_deltalake. You can pass an Iterable[pa.RecordBatch] directly to it. (You just also need to know the schema separately. Are you suggesting a helper that takes the first batch and passes its schema to write_deltalake?)

@kylebarron
Copy link
Collaborator Author

This PR should be ready to go, where we don't yet solve the null type issue, but rather for now require in those cases that the user handle schema resolution manually.

In a follow up PR we may want to consider defaulting null types to string, but that may complicate schema evolution if later data has non-null values for STAC keys.

@kylebarron kylebarron merged commit e43398b into main Jun 5, 2024
1 check passed
@TomAugspurger TomAugspurger deleted the kyle/delta-lake-interop branch June 7, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants