Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dtype of zarr array unexpectedly changes when fill_value is specified #7292

Open
4 tasks done
benjeffery opened this issue Nov 16, 2022 · 2 comments
Open
4 tasks done
Labels
bug topic-backends topic-zarr Related to zarr storage library

Comments

@benjeffery
Copy link

What happened?

Opening a zarr group which contains an array of integer dtype with a fill_value results in an xarray dataset in which the array has floating-point dtype.

What did you expect to happen?

An xarray dataset in which the array has the original integer dtype.

Minimal Complete Verifiable Example

import zarr
import xarray

#Create zarr with integer dtype and fill_value
grp = zarr.open_group("test.zarr")
arr = grp.create(shape=(10,), name="array", dtype="int8", fill_value=-1)
arr.attrs['_ARRAY_DIMENSIONS'] = ['dim1']

#Open in xarray to see that the dtype is now float32
ds = xarray.open_zarr("test.zarr", consolidated=False)
ds['array'].dtype

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.

Relevant log output

No response

Anything else we need to know?

This is a result of #5475 where xarray's _FillValue has a different meaning to zarr's fill_value.

The change of dtype happens at

dtype, decoded_fill_value = dtypes.maybe_promote(data.dtype)
where xarray is trying to find a dtype where fill_value can represent "missing" data, wheras in zarr, fill_value can be any data value as its intent is to fill in missing chunks not represent missing data.

I'm not sure how best to fix this - maybe if the zarr fill value is clearly a non-missing value for the dtype then xarray should act as if it doesn't have a fill value? Happy to work on a PR if that seems to be a valid approach, although others may have thought on if that is a breaking change for some folks.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.15.0-47-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: ('en_GB', 'UTF-8') libhdf5: None libnetcdf: None

xarray: 2022.11.0
pandas: 1.3.5
numpy: 1.21.6
scipy: 1.9.3
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.13.3
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.01.0
distributed: 2022.01.0
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 59.6.0
pip: 22.0.2
conda: None
pytest: None
IPython: None
sphinx: None

@benjeffery benjeffery added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 16, 2022
@dcherian dcherian added topic-zarr Related to zarr storage library topic-backends and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 16, 2022
@tomwhite
Copy link
Contributor

You can work around this by specifying either decode_cf=False or mask_and_scale=False. For example:

ds = xarray.open_zarr("test.zarr", consolidated=False, decode_cf=False)
ds['array'].dtype

prints dtype('int8').

Does that help?

@benjeffery
Copy link
Author

You can work around this by specifying either decode_cf=False or mask_and_scale=False. For example:

ds = xarray.open_zarr("test.zarr", consolidated=False, decode_cf=False)
ds['array'].dtype

prints dtype('int8').

Does that help?

Yes, but to solve my use case, sgkit would need to open the zarr in this way. Would that cause problems elsewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-backends topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

3 participants