Skip to content

Commit

Permalink
Add usage file
Browse files Browse the repository at this point in the history
  • Loading branch information
kylebarron committed Jun 24, 2024
1 parent cf20e98 commit 15b1e31
Show file tree
Hide file tree
Showing 4 changed files with 78 additions and 29 deletions.
27 changes: 2 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,35 +12,12 @@ For analytic questions like "find the items in the Sentinel-2 collection in June

See the [STAC-GeoParquet specification](./spec/stac-geoparquet-spec.md) for details on the exact schema of the written Parquet files.

## Usage

Use `stac_geoparquet.to_arrow.stac_items_to_arrow` and
`stac_geoparquet.from_arrow.stac_table_to_items` to convert between STAC items
and Arrow tables. Arrow Tables of STAC items can be written to parquet with
`stac_geoparquet.to_parquet.to_parquet`.

Note that `stac_geoparquet` lifts the keys in the item `properties` up to the top level of the DataFrame, similar to `geopandas.GeoDataFrame.from_features`.

```python
>>> import requests
>>> import stac_geoparquet.arrow
>>> import pyarrow.parquet
>>> import pyarrow as pa

>>> items = requests.get(
... "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items"
... ).json()["features"]
>>> table = pa.Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))
>>> stac_geoparquet.arrow.to_parquet(table, "items.parquet")
>>> table2 = pyarrow.parquet.read_table("items.parquet")
>>> items2 = list(stac_geoparquet.arrow.stac_table_to_items(table2))
```


<!--
## pgstac integration
`stac_geoparquet.pgstac_reader` has some helpers for working with items coming from a `pgstac.items` table. It takes care of
- Rehydrating the dehydrated items
- Partitioning by time
- Injecting dynamic links and assets from a STAC API
- Injecting dynamic links and assets from a STAC API -->
17 changes: 17 additions & 0 deletions docs/api/legacy.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,23 @@

The API listed here was the initial non-Arrow-based STAC-GeoParquet implementation, converting between JSON and GeoPandas directly. For large collections of STAC items, using the new Arrow-based functionality (under the `stac_geoparquet.arrow` namespace) will be more performant.

Note that `stac_geoparquet` lifts the keys in the item `properties` up to the top level of the DataFrame, similar to `geopandas.GeoDataFrame.from_features`.

```python
>>> import requests
>>> import stac_geoparquet.arrow
>>> import pyarrow.parquet
>>> import pyarrow as pa

>>> items = requests.get(
... "https://planetarycomputer.microsoft.com/api/stac/v1/collections/sentinel-2-l2a/items"
... ).json()["features"]
>>> table = pa.Table.from_batches(stac_geoparquet.arrow.parse_stac_items_to_arrow(items))
>>> stac_geoparquet.arrow.to_parquet(table, "items.parquet")
>>> table2 = pyarrow.parquet.read_table("items.parquet")
>>> items2 = list(stac_geoparquet.arrow.stac_table_to_items(table2))
```

::: stac_geoparquet.to_geodataframe
::: stac_geoparquet.to_item_collection
::: stac_geoparquet.to_dict
14 changes: 10 additions & 4 deletions docs/schema.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# Schema considerations

A _schema_ is a set of named fields that describe what data types exist for each column in a tabular dataset.
A STAC Item is a JSON object to describe an external geospatial dataset. The STAC specification defines a common core, plus a variety of extensions. Additionally, STAC Items may include custom extensions outside the common ones. Crucially, the majority of the specified fields in the core spec and extensions define optional keys. Those keys often differ across STAC collections and may even differ within a single collection across items.

JSON and Arrow/Parquet have opposite expectations around schemas. JSON is fully schemaless; every object in a JSON array may have a completely different schema. Meanwhile Arrow and Parquet require a strict schema upfront, and it is impossible for a row in an Arrow table or Parquet file to not comply with the upfront schema. Trying to write STAC data with mixed schemas to Parquet can easily produce exceptions.
STAC's flexibility is a blessing and a curse. The flexibility of schemaless JSON allows for very easy writing as each object can be dumped separately to JSON. Every item is allowed to have a different schema. And newer items are free to have a different schema than older items in the same collection. But this write-time flexibility makes it harder to read as there are no guarantees (outside STAC's few required fields) about what fields exist.

Because of these opposite expectations, schema inference is the most difficult part of converting STAC from JSON to GeoParquet.
Parquet is the complete opposite of JSON. Parquet has a strict schema that must be known before writing can start. This puts the burden of work onto the writer instead of the reader. Reading Parquet is very efficient because the file's metadata defines the exact schema of every record. This also enables use cases like reading specific columns that would not be possible without a strict schema.

This conversion from schemaless to strict-schema is the difficult part of converting STAC from JSON to GeoParquet, especially for large input datasets like STAC that are often larger than memory.

## Full scan over input data

Expand All @@ -16,7 +18,11 @@ This is time consuming as it requires two full passes over the input data: once

Alternatively, the user can pass in an Arrow schema themselves using the `schema` parameter of [`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow]. This `schema` must match the on-disk schema of the the STAC JSON data.

## Merging data with schema mismatch
## Multiple schemas per collection

It is also possible to write multiple Parquet files with STAC data where each Parquet file may have a different schema. This simplifies the conversion and writing process but makes reading and using the Parquet data harder.

### Merging data with schema mismatch

If you've created STAC GeoParquet data where the schema has updated, you can use [`pyarrow.concat_tables`][pyarrow.concat_tables] with `promote_options="permissive"` to combine multiple STAC GeoParquet files.

Expand Down
49 changes: 49 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
@@ -1 +1,50 @@
# Usage

Except for the [legacy API](api/legacy.md), [Apache Arrow](https://arrow.apache.org/) is used as the in-memory interchange format between all formats. While some end-to-end helper functions are provided, the user can go through Arrow objects for maximal flexibility in the conversion process.

All functionality that goes through Arrow is currently exported via the `stac_geoparquet.arrow` namespace.

## `dict`/JSON - Arrow conversion

### Convert `dict`s to Arrow

Use [`parse_stac_items_to_arrow`][stac_geoparquet.arrow.parse_stac_items_to_arrow] to convert STAC items either in memory or on disk to a stream of Arrow record batches. This accepts either an iterable of Python `dict`s or an iterable of [`pystac.Item`][pystac.Item] objects.

### Convert JSON to Arrow

[`parse_stac_ndjson_to_arrow`][stac_geoparquet.arrow.parse_stac_ndjson_to_arrow] is a helper function to take one or more JSON or newline-delimited JSON files on disk, infer the schema from all of them, and convert the data to a stream of Arrow record batches.

### Convert Arrow to `dict`s

Use [`stac_table_to_items`][stac_geoparquet.arrow.stac_table_to_items] to convert a table or stream of Arrow record batches of STAC data to a generator of Python `dict`s. This accepts either a `pyarrow.Table` or a `pyarrow.RecordBatchReader`, which allows conversions of larger-than-memory files in a streaming manner.

### Convert Arrow to JSON

Use [`stac_table_to_ndjson`][stac_geoparquet.arrow.stac_table_to_ndjson] to convert a table or stream of Arrow record batches of STAC data to a generator of Python `dict`s. This accepts either a `pyarrow.Table` or a `pyarrow.RecordBatchReader`, which allows conversions of larger-than-memory files in a streaming manner.

## Parquet

Use [`to_parquet`][stac_geoparquet.arrow.to_parquet] to write STAC Arrow data in memory. This is a special function to ensure that [GeoParquet](https://geoparquet.org/) 1.0 or 1.1 metadata is written to the Parquet file.

[`parse_stac_ndjson_to_parquet`][stac_geoparquet.arrow.parse_stac_ndjson_to_parquet] is a helper that connects reading (newline-delimited) JSON on disk to writing out to a Parquet file.

No special API is required for reading a STAC GeoParquet file back into Arrow. You can use [`pyarrow.parquet.read_table`][pyarrow.parquet.read_table] or [`pyarrow.parquet.ParquetFile`][pyarrow.parquet.ParquetFile] directly to read the STAC GeoParquet data back into Arrow.

## Delta Lake


Use [`parse_stac_ndjson_to_delta_lake`][stac_geoparquet.arrow.parse_stac_ndjson_to_delta_lake] to read (newline-delimited) JSON on disk and write out to a Delta Lake table.

No special API is required for reading a STAC Delta Lake table back into Arrow. You can use the [`DeltaTable`][deltalake.DeltaTable] class directly to read the data back into Arrow.

!!! important
Arrow has a null data type, where every value in the column is always null, but Delta Lake does not. This means that for any column inferred to have a `null` data type, writing to Delta Lake will error with
```
_internal.SchemaMismatchError: Invalid data type for Delta Lake: Null
```

This is a problem because if all items in a STAC Collection have a `null` JSON key, it gets inferred as an Arrow `null` type. For example, in the `3dep-lidar-copc` collection in the tests, it has `start_datetime` and `end_datetime` fields, and so according to the spec, `datetime` is always `null`. This column would need to be casted to a timestamp type before being written to Delta Lake.

This means we cannot write this collection to Delta Lake **solely with automatic schema inference**.

In such cases, users may need to manually update the inferred schema to cast any `null` type to another Delta Lake-compatible type.

0 comments on commit 15b1e31

Please sign in to comment.