Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed base item merging logic for assets #2

Open
TomAugspurger opened this issue Sep 23, 2022 · 2 comments
Open

Fixed base item merging logic for assets #2

TomAugspurger opened this issue Sep 23, 2022 · 2 comments

Comments

@TomAugspurger
Copy link
Collaborator

In this snippet, there's a record with None for an asset:

import planetary_computer
import adlfs
import pystac

collection = pystac.read_file("https://planetarycomputer.microsoft.com/api/stac/v1/collections/aster-l1t")
asset = planetary_computer.sign(collection.assets["geoparquet-items"])

import dask_geopandas

ddf = dask_geopandas.read_parquet(asset.href, storage_options=asset.extra_fields["table:storage_options"])
df = ddf.head()

df.assets.iloc[0]["qa-txt"]  # None

This shows up on the base item in pgstac, but isn't on that actual item. It was incorrectly rehydrated.

@jsignell
Copy link
Member

Do you think this is the same issue?

import geopandas
import pystac
import stac_geoparquet

URL = "https://www.planet.com/data/stac/disasters/hurricane-harvey/catalog.json"
catalog = pystac.read_file(URL)
dicts = [item.to_dict() for item in catalog.get_items(recursive=True)]
df =  stac_geoparquet.to_geodataframe(dicts)
assert "full-jpg" not in df.loc[0].assets

df.to_parquet(f"{catalog.id}.parq")
df = geopandas.read_parquet(f"{catalog.id}.parq")
assert "full-jpg" not in df.loc[0].assets

It feels like the serialization to parquet is causing Nones to be added to the dicts.

@jsignell
Copy link
Member

It looks like maybe the thing to do is to convert to string before storing arbitrary json blobs?

import json
import geopandas
import pystac
import stac_geoparquet

URL = "https://www.planet.com/data/stac/disasters/hurricane-harvey/catalog.json"
catalog = pystac.read_file(URL)
dicts = [item.to_dict() for item in catalog.get_items(recursive=True)]
df =  stac_geoparquet.to_geodataframe(dicts)
df.assets = df.assets.apply(json.dumps)
assert "full-jpg" not in df.loc[0].assets

df.to_parquet(f"{catalog.id}.parq")
df = geopandas.read_parquet(f"{catalog.id}.parq")
df.assets = df.assets.apply(json.loads)
assert "full-jpg" not in df.loc[0].assets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants