-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write to Delta Lake #58
Conversation
from deltalake import DeltaTable | ||
|
||
|
||
def parse_stac_ndjson_to_delta_lake( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make a generic arrow_batches_to_delta_lake that takes in batches and then just make the parse_stac_ndjson_to_delta_lake a wrapper around that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, that generic arrow_batches_to_delta_lake
function is literally just write_deltalake
. You can pass an Iterable[pa.RecordBatch]
directly to it. (You just also need to know the schema separately. Are you suggesting a helper that takes the first batch and passes its schema to write_deltalake
?)
This PR should be ready to go, where we don't yet solve the In a follow up PR we may want to consider defaulting null types to string, but that may complicate schema evolution if later data has non-null values for STAC keys. |
This PR adds a new function
parse_stac_ndjson_to_delta_lake
to convert a JSON source to a Delta Lake table. It is based on #57, so only look at the most recent commits, and that PR should be merged first.There's a complication here: Delta Lake refuses to write any column inferred with data type
null
, with:This is a problem because if all items in a STAC Collection have a
null
JSON key, it gets inferred as an Arrownull
type. For example, in the3dep-lidar-copc
collection in the tests, it hasstart_datetime
andend_datetime
fields, and so according to the spec,datetime
is alwaysnull
. This means we cannot write this collection to Delta Lake solely with automatic schema inference.In the latest commit I started to implement some manual schema modifications for
datetime
andproj:epsg
, which fixed the error for3dep-lidar-copc
. But3dep-lidar-dsm
has more fields that are inferred as null. In particular the schema paths:are both
null
. It's not ideal to hard-code manual overrides for every extension, so we should discuss how to handle this.Possible options:
null
fields from the JSON before reading. This would be easier to fit into an Arrow schema, but would lose meaning. E.g.proj:epsg
set to null has specific semantic meaning that we don't want to lose.datetime
,proj:epsg
, etc.