Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatibility for zarr-python 3.x #9552

Open
wants to merge 74 commits into
base: main
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Sep 27, 2024

This is a WIP for compatibility with zarr-python 3.x. It's intended to be run against zarr-python v3 + the open PRs referenced in #9515.

All of the zarr test cases should be parameterized by zarr_format=[2, 3] with zarr-python 3.x to exercise reading and writing both formats.

This is currently passing with zarr-python==2.18.3. zarr-python 3.x has about 61 failures, all of which are related to data types that aren't yet implemented in zarr-python 3.x.

I'll also note that #5475 is going to become a larger issue once people start writing Zarr-V3 datasets.

  • Closes Zarr Python 3 tracking issue #9515
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

@TomAugspurger TomAugspurger force-pushed the fix/zarr-v3 branch 2 times, most recently from 1ed4ef1 to bb2bb6c Compare September 30, 2024 14:04
Copy link
Contributor Author

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This set of changes should be backwards compatible and work with zarr-python 2.x (so reading and writing zarr v2 data).

I'll work through zarr-python 3.x now. I think we might want to parametrize most of these tests by zarr_version=[2, 3] to confirm that we can read / write zarr v2 data with zarr-python 3.x

xarray/backends/zarr.py Show resolved Hide resolved
@@ -75,8 +89,10 @@ def __init__(self, zarr_array):
self.shape = self._array.shape

# preserve vlen string object dtype (GH 7328)
if self._array.filters is not None and any(
[filt.codec_id == "vlen-utf8" for filt in self._array.filters]
if (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr-developers/zarr-python#2036 is probably relevant here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm whether we need any logic on the branch where we do have Zarr V3.


if _zarr_v3() and zarr_array.metadata.zarr_format == 3:
encoding["codec_pipeline"] = [
x.to_dict() for x in zarr_array.metadata.codecs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this instead?

Suggested change
x.to_dict() for x in zarr_array.metadata.codecs
zarr_array.metadata.to_dict()["codecs"]

A bit wasteful since everything has to be serialized, but presumably zarr knows better how to serialize the codec pipeline than we do here?

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress here @TomAugspurger. I'm impressed by how little you've changed in the backend itself and I'm noting the pain around testing (I felt some of that w/ dask as well).

if consolidated is None:
try:
zarr_group = zarr.open_consolidated(store, **open_kwargs)
except KeyError:
except (ValueError, KeyError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on the Zarr side, it may be nice to raise a a custom exception when consolidated metadata is not found. Something like:

class ConsolidatedMetadataNotFound(FileNotFoundError):
    pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xarray/backends/zarr.py Show resolved Hide resolved
@jhamman
Copy link
Member

jhamman commented Oct 13, 2024

I'm going to spend some time today working on the last few failures here.

Some things I'm noticing that will need some attention:

  1. In the datatree tests, we need to be careful to strip leading slashes from keys before they make it to the store. Else we risk ending up with Permission denied: '/Group1'
  2. We likely need to adjust group.__getitem__ in zarr to handle nested group look ups when using consolidated metadata

@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

@TomAugspurger - a thought that may help limit the scope of this PR. We may consider punting on full datatree support for zarr-v3 in this PR and fix that up in follow-on work. What do you think about adding pytest.skip decorators on the datatree failures?

@TomAugspurger
Copy link
Contributor Author

I haven't looked at the datatree side of things yet, so that sounds good to me :)

@rabernat
Copy link
Contributor

Another idea to simplify things would be to disallow (or at least discourage) consolidated metadata with V3 data, given the uncertain status of consolidated metadata in the V3 spec. Maybe default to consolidated=False with V3 and issue a warning if True?

@TomNicholas
Copy link
Member

We may consider punting on full datatree support for zarr-v3 in this PR and fix that up in follow-on work.

What exactly is the scenario/version where this doesn't work? I would like to release a version of Xarray where you can still use DataTree to open a Zarr V2 store. Then I wouldn't mind us fixing other cases later.

@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

What exactly is the scenario/version where this doesn't work? I would like to release a version of Xarray where you can still use DataTree to open a Zarr V2 store. Then I wouldn't mind us fixing other cases later.

@TomNicholas - my proposal would maintain existing datatree functionality for Zarr-Python 2 but would postpone doing the integration work for Zarr-Python 3 for another PR.

The specific issues are mostly upstream and may take a few days to sort out.

@TomNicholas
Copy link
Member

That sounds fine to me!

@TomAugspurger
Copy link
Contributor Author

Another idea to simplify things would be to disallow (or at least discourage) consolidated metadata with V3 data, given the uncertain status of consolidated metadata in the V3 spec.

This might be a bit tricky to implement. The current default behavior is to try consolidated metadata and emit a warning and fall back to non-consolidated metadata. However, we might not know whether we have V2 or V3 data until after we've read the data, so we couldn't warn until after we've fallen back to non-consolidated and discovered what we have.

I'm not sure about the write side.

IMO, the downsides of lack of consolidated metadata, and my confidence that something like consolidated metadata will end up in v3 pushes me to try to support it with the current API. If we do need to adjust anything to comply with the spec I think we'll be able to paper over it in code and not have to change the user-facing API.

@TomAugspurger
Copy link
Contributor Author

I'll have a fix for the failing TestInstrumentedStore tests soon.

@TomAugspurger
Copy link
Contributor Author

xarray/tests/test_backends.py::test_zarr_storage_options is tricky to test. We should be able to just pass through storage_options to the call to zarr.open_group.

Currently, the only zarr store that supports storage options is RemoteStores with an async filesystem. The current test uses the memory filesystem, which doesn't support async operations. Zarr-python sets up a moto server to use the S3 filesystem.

@jhamman
Copy link
Member

jhamman commented Oct 14, 2024

xarray/tests/test_backends.py::test_zarr_storage_options is tricky to test. We should be able to just pass through storage_options to the call to zarr.open_group.

Let's skip this test with v3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-upstream Run upstream CI topic-zarr Related to zarr storage library
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants