Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is _FillValue really the same as zarr's fill_value? #5475

Open
d70-t opened this issue Jun 16, 2021 · 24 comments
Open

Is _FillValue really the same as zarr's fill_value? #5475

d70-t opened this issue Jun 16, 2021 · 24 comments
Labels
topic-CF conventions topic-zarr Related to zarr storage library

Comments

@d70-t
Copy link
Contributor

d70-t commented Jun 16, 2021

The zarr backend uses the fill_value of zarrs .zarray key as if it would be the _FillValue according to CF-Conventions:

attributes["_FillValue"] = zarr_array.fill_value

I think this interpretation of the fill_value is wrong and creates problems. Here's why:

The zarr v2 spec is still a little vague, but states that fill_value is

A scalar value providing the default value to use for uninitialized portions of the array, or null if no fill_value is to be used.

Accordingly this value should be used to fill all areas of a variable which are not backed by a stored chunk with this value. This is also different from what CF conventions state (emphasis mine):

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable. This value is considered to be a special value that indicates undefined or missing data, and is returned when reading values that were not written.

The difference between the two is, that fill_value is only a background value, which just isn't stored as a chunk. But _FillValue is (possibly) a background value and is interpreted as not being valid data. In my opinion, this mix of _FillValue and missing_value could be considered a defect in the CF-Conventions, but probably that's far to late as many depend on this.

Thinking of an example, when storing a density field (i.e. water droplets forming clouds) in a zarr dataset, it might be perfectly valid to set the fill_value to 0 and then store only chunks in regions of the atmosphere where clouds are actually present. In that case, 0 (i.e. no drops) would be a perfectly valid value, which just isn't stored. As most parts of the atmosphere are indeed cloud-free, this may save quite a bunch of storage. Other formats (e.g. OpenVDB) commonly use this trick.


The issue gets worse when looking into the upcoming zarr v3 spec where fill_value is described as:

Provides an element value to use for uninitialised portions of the Zarr array.

If the data type of the Zarr array is Boolean then the value must be the literal false or true. If the data type is one of the integer data types defined in this specification, then the value must be a number with no fraction or exponent part and must be within the range of the data type.

For any data type, if the fill_value is the literal null then the fill value is undefined and the implementation may use any arbitrary value that is consistent with the data type as the fill value.

[...]

Thus for boolean arrays, if the fill_value would be interpreted as a missing value indicator, only (missing, True) or (False, missing) arrays could be represented. A (False, True) array would not be possible. The issue applies similarly for integer types as well.

@rly
Copy link

rly commented Mar 28, 2024

I encountered this issue today and totally agree. In Zarr, fill_value=X is used to say all uninitialized parts of the array have value X (useful for sparse arrays), NOT the meaning of _FillValue which is that all values in the array with value X are missing/undefined and so Xarray should convert those values to NaN. This is a minor issue for Zarr v2 and will be a bigger issue in Zarr v3, particularly for certain types of dtypes as @d70-t mentions. To update @d70-t 's comment on Zarr v3, it looks like in v3, fill_value: null is not allowed to be recorded in the metadata.

The default fill_value set by Zarr-Python for an array is 0 for int, 0.0 for float, and False for bool. This means that by default, when Xarray loads a Zarr array with one of those values, that value is converted to NaN.

Here is an MWE with ints:

import zarr
import xarray as xr

store = zarr.DirectoryStore("zarr-xarray-example.zarr")
root = zarr.group(store=store, overwrite=True)
arr = root.create_dataset(name="arr", data=[0, 1, 2, 3, 4], dtype="i4")
arr.attrs["_ARRAY_DIMENSIONS"] = ["phony_dim_0"]
zarr.consolidate_metadata(store)

root = zarr.open("zarr-xarray-example.zarr")
print(root["arr"].fill_value)

dset = xr.open_dataset("zarr-xarray-example.zarr", engine="zarr")
dset.arr.data

Returns:

0
array([nan,  1.,  2.,  3.,  4.])

This was very confusing at first. Why did 0 become NaN? Why did the ints become floats? Upon digging around, I learned that this automatic conversion/masking can be turned off by passing the argument mask_and_scale=False (default is True) or decode_cf=False (default is None) to open_dataset. However, because the same argument controls both masking and scaling, I cannot have Xarray use the scale_factor and add_offset attributes of a sparse zarr array that has fill_value=X, without all X values in the array (which represent value X, not invalid/missing data) being converted to NaN. That's frustrating. This conversion does not happen if the zarr array has fill_value=None, which has to be explicitly set in Zarr-Python, and will be invalid in v3 (I believe).

Now, this is a very specific and probably uncommon use case. But still, Zarr's fill_value and xarray/netcdf's _FillValue mean different things, and I think automatically setting _FillValue to the value of the zarr fill_value on read is incorrect and makes it hard to work with non-Xarray/non-CF Zarr data in Xarray.

To resolve this, I suggest that on write, Xarray writes the dataset _FillValue property to a _FILL_VALUE attribute on the Zarr array (similar to _ARRAY_DIMENSIONS) instead of to the fill_value property, and then on read, Xarray sets the dataset _FillValue property based on that new attribute, NOT the zarr array fill_value property. That would preserve roundtrip behavior of Xarray to/from Zarr, while not messing with the loading of non-Xarray/non-CF Zarr arrays with fill_value into Xarray. This would also help with eventual Zarr v3 support. Naming the attribute something else like _MASK_VALUE or _MISSING_VALUE would might be less confusing.

However, this approach would break the default masking behavior when loading existing Zarr data that lacks this new _FILL_VALUE attribute, particularly those Zarr files written by Xarray which used fill_value to mean _FillValue. Maybe if a user wants to set the dataset _FillValue property to the zarr array fill_value property, they could provide a new argument use_fill_value in open_zarr. It could have default True and then switch to default False after some time and deprecation warnings? I'm not sure how best to handle that.

@d70-t
Copy link
Contributor Author

d70-t commented Apr 2, 2024

Thanks @rly for coming back on this. I agree, this issue still persists and might become worse in Zarr v3. There's another aspect which might be problematic:

From CF-Conventions §2.5.1:

Missing data is not allowed in coordinate variables.

So if fill_value of Zarr v3 always has to be defined (i.e. be non-null) for all variables, the current behaviour would be directly in conflict with CF-Conventions.

@rly, why would you suggest to invent a new attribute name? CF-Convention says (same paragraph):

If only one missing value is needed for a variable then we recommend that this value be specified using the _FillValue attribute.

So I would "just" call the attribute _FillValue in zarr attributes.

I'd probably suggest something like this as a strategy to fix this:

  • xarray could start writing the _FillValue attribute now (e.g. in addition to fill_value), without much of a change in behaviour. This would preprare newly written datasets for the upcoming changes.
  • at some point, xarray (maybe even now) could prefer the _FillValue attribute from the zarr attributes over fill_value if it exists. This could come with a warning if the two differ in order to educate people, but eventually this warning should just go away, because we'd expect a difference between _FillValue and fill_value to be valid (as detailed in above comments).
  • somewhere in future, the using fill_value to set _FillValue could probably be deprecated (maybe with an option to re-activate it manually upon read).

@Chiil
Copy link

Chiil commented Jun 11, 2024

I ran into the same problem as well. This simple code fails, because xarray interpret the default value of zero as a missing value. It would be great if this gets fixed, because it is easy to overlook.

import zarr
import xarray as xr
import numpy as np

with zarr.open('dump.zarr', mode='w') as z:
    z_time = z.zeros('time', shape=(0), chunks=(1), dtype='f4')
    z_time.attrs['_ARRAY_DIMENSIONS'] = ['time']

for i in range(0, 10):
    with zarr.open('dump.zarr', mode='r+') as z:
        ii = np.array([i], dtype='f4')
        z['time'].append(ii)

with zarr.open('dump.zarr', mode='r') as z:
    print(z['time'].info)
    print(z['time'][:])

with xr.open_dataset('dump.zarr') as z:
    print(z)
    print(z['time'])

The latter print gives a nan for the value time equals 0.0.

@dcherian
Copy link
Contributor

In the mean time, you can set mask_and_scale=False in open_dataset and it should avoid this problem

@TomAugspurger
Copy link
Contributor

@jhamman I'm looking into getting xarray ready for Zarr v3 and bumped into this pretty quickly. AFAICT, zarr v3 requires that fill_value be in the array metadata, and so xarray's "overloading" fill value to mean "invalid data to be masked out" causes issues.

I'm getting up to speed on xarray's Zarr reader now and the implications of the change, but for now it seems like just not using fill_value in the zarr reader is maybe the best option. If you or Deepak have a better suggestion then LMK.

@dcherian
Copy link
Contributor

dcherian commented Sep 9, 2024

What is the issue? Are you seeing a roundtripping problem?

@TomAugspurger
Copy link
Contributor

Yes, here's the smallest reproducer I have (this won't run for you though. I have some other changes to zarr-python fixing some compatibility issues):

import zarr
import xarray as xr

store = zarr.store.MemoryStore(mode="w-")
ds = xr.Dataset({"x": xr.DataArray([0, 2, 3], name="x")})
ds.to_zarr(store)

result = xr.open_dataset(store, engine="zarr", mode="w-")
xr.testing.assert_identical(ds, result)

which fails with

Differing data variables:
L   x        (dim_0) int64 24B 0 2 3
R   x        (dim_0) float64 24B nan 2.0 3.0

The (valid) 0 at position 0 in the array has been masked out and replaced by NaN, I believe in https://github.com/pydata/xarray/blob/main/xarray/coding/variables.py#L569-L597, thanks to the _FillValue read from the zarr.json at

# _FillValue needs to be in attributes, not encoding, so it will get
# picked up by decode_cf
if getattr(zarr_array, "fill_value") is not None:
attributes["_FillValue"] = zarr_array.fill_value
. From zarr V2 to V3, the difference is that V3 is explicitly writing the fill_value into the Array Metadata.


As far as I can tell, the risk with ignoring fill_value in xarray is low... With zarr-python 2.x:

>>> store = {}
>>> ds = xr.Dataset({"x": xr.DataArray([0, np.nan, 3], name="x", attrs={"_FillValue": 0.0})})
>>> ds.to_zarr(store)
>>> result = xr.open_dataset(store, engine="zarr", mode="r")
>>> print(result.x.load())
<xarray.DataArray 'x' (dim_0: 3)> Size: 24B
array([nan, nan,  3.])
Dimensions without coordinates: dim_0

I can't really tell if that's expected according to CF-conventions or not, but it seems surprising. That said, that seems to be a fundamental limitation of encoding missing value "in-band". After encoding the data (by setting NaN to 0) we can't tell whether a 0 means 0 or NaN. This does match the behavior of Dataset.to_netcdf when _FillValue is set in the attrs, so I guess the behavior is correct.


I think it comes down to the decision on whether or not Zarr's fill_value should be automatically interpreted as the _FillValue when reading and writing. IMO, they seem different.

@dcherian
Copy link
Contributor

but it seems surprising.
so I guess the behavior is correct.

The behaviour is correct from Xarray's POV but this example isn't well defined from a data POV. For one, in netCDF land _FillValue is supposed to be outside the range of valid data (reference).

And yeah, Xarray's fill value decoding is already lossy (it doesn't distinguish between _FillValue and missing_value) so the roundtrip test should only work with decode_cf=False.

Zarr's fill_value should be automatically interpreted as the _FillValue when reading and writing. IMO, they seem different.

It's confusing and the meaning has become overloaded over time, but the original intent seem the same to me IMO (treating zarr's write_empty_chunks as an implementation detail) (reference):
(1) "The _FillValue attribute specifies the fill value used to pre-fill disk space allocated to the variable."
(2) "The fill value is returned when reading values that were never written."

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 10, 2024

What are our options (spoiler: after writing this up I think 4 is most promising)?

  1. Change the Zarr v3 spec to make fill_value optional.?

This would effectively restore the Zarr v2 behavior: the "surprising" behavior of valid values potentially being masked as invalid would only affect users where _FillValue / fill_value happens be set explicitly.

I don't really like this option. Ideally zarr datasets are portable across implementations, and having some implementations interpret the same dataset differently because they happen to have different default fill values makes me uneasy.

  1. Use different default fill_values for zarr-python v3?

The fact that zarr-python v3 uses 0 as the default fill value when creating an array in memory makes this more problematic in practice. Using other default fill values (NaN for float, NaT / INT_MIN for datetime, INT_MIN for int?) might mean users run into this less often. But it doesn't help for bool at all. uint would need to chose 0 to be consistent with the others, or INT_MAX. Meh.

  1. Change the default in xarray to not mask these values?

I don't know what the tolerance is here.

  1. Make the masking function smarter to only mask uninitialized regions when decoding

IIUC, the main purpose of this _FillValue is "reading values that were never written." What if we only applied the masking in CFMaskCoder at

def _apply_mask(
data: np.ndarray,
encoded_fill_values: list,
decoded_fill_value: Any,
dtype: np.typing.DTypeLike,
) -> np.ndarray:
"""Mask all matching values in a NumPy arrays."""
data = np.asarray(data, dtype=dtype)
condition = False
for fv in encoded_fill_values:
condition |= data == fv
return np.where(condition, decoded_fill_value, data)
only to chunks that weren't initialized? I believe that zarr-python knows which chunks of an Array are initialized.

Then we avoid the situation where you have some valid values cast to NaN (or whatever) since they happened to be equal to fill value. You wouldn't have any valid values in the chunk since it was never initialized.

This might be better done in Zarr-Python, and IIRC there's some desire to move some of the CF related logic into Zarr Python as new codecs.

@dcherian
Copy link
Contributor

dcherian commented Sep 12, 2024

TLDR: perhaps we change the tests instead?


I'm not sure there is a good overarching solution.

fill_value can either be semantically valid (e.g. the sparse array case or "background value"; this issue) or semantically invalid (_FillValue or missing_value in CF/netCDF case). We can't decide which meaning the user has assigned to Zarr's fill_value without extra information (NUG/CF provides valid_min, valid_max, valid_range, and missing_value for example).

Xarray's mask_and_scale decoder assumes _FillValue is to be interpreted as semantically invalid always and that it is best represented in-memory by a NaN. IMO this is the core problem. This can be prevented by explicitly setting mask_and_scale=False and manually masking if you need to.


Responding to your suggestions
(1) Change the Zarr v3 spec to make fill_value optional? +1 I don't understand why fill_value must be required. IIRC even netcdf has a no-fill mode.
(2) Use different default fill_values for zarr-python v3? implicit default fill values aren't great especially since the meaning is already not-explicit. netCDF4 does have these but Xarray does not use them (#2742)
(3) Change the default in xarray to not mask these values? We should allow more sophisticated control over the decoding pipeline (there are already requests for this). For example, only mask missing_value when present, etc. It seems too late to change this default.
(4) Make the masking function smarter to only mask uninitialized regions when decoding? This ignores the semantic issues. An uninitialized region could just a sparse array chunk with no non-fill-value data.


In general, I think we should distinguish between what is recorded on disk, and what is interpreted as an in-memory data model. To me, it seems this issue is about two things:

  1. Imperfect recording of semantic meaning on disk
  2. Xarray's imperfect decoding of the recorded information (no support for masked arrays, no handling of valid_* and blindly assuming _FillValue is semantically invalid or missing data).

This implies that we need to have two subtly different roundtripping tests:
(1) being exactly correct roundtripping data from the backend with decode_cf=False and potentially a new encode_cf=False. This was we know we receive and send unmodified values to the backend.
(2) being exactly correct when writing any Xarray object to disk and reading it back. So we ensure that Xarray is internally consistent, and Xarray users can roundtrip their data perfectly.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 17, 2024

Thanks, I've been thinking this over since yesterday and am still not sure where I sit. I just keep coming back to this being a really bad behavior

In [1]: import xarray as xr

In [2]: ds = xr.Dataset({"x": xr.DataArray([0, 1, 2], name="x")})

In [3]: store = {}

In [4]: ds.to_zarr(store)
Out[4]: <xarray.backends.zarr.ZarrStore at 0x10914d640>

In [5]: xr.open_dataset(store, engine="zarr").load()
Out[5]:
<xarray.Dataset> Size: 24B
Dimensions:  (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
    x        (dim_0) float64 24B nan 1.0 2.0

So I guess we agree that xarray interpreting values equal to fill_value as NaN (at least unconditionally) is the core problem.

With that said, would xarray be open to me implementing #8359, which AFAICT proposes to only do the masking for values that happen to fall outside of valid range defined by those attributes, and so are "semantically invalid"?

Under that proposal, my snippet would correctly round trip. To restore the old behavior, users would need to specify the valid_min / max / range in the attrs (or encoding?), such that the fill_value (which zarr defaults to 0, but could be overridden) falls outside the valid range.


zarr-developers/zarr-specs#133 has the discussion on making fill_value required in the Zarr v3 spec. I'm pretty convinced by them: it seems... problematic for a simple operation like "open an array and read a value" to give different answers depending on what default fill value your zarr implementation happens to use. But that's a discussion for the zarr devs, and I think xarray should probably handle both cases well (since there will always be some datasets specifying the fill value).

@d70-t
Copy link
Contributor Author

d70-t commented Sep 17, 2024

Yes, I'd agree the core issue is: xarray treats zarr's fill_value as NaN, unconditionally. In many cases, that is a really bad behavior.

With the recent discussion, I guess I see that the original idea of netCDF's _FillValue is similar to zarr's fill_value, so it seems to be a sensible choice to use them interchangeably. I guess the deeper issue here is, that since all the time of netCDF, _FillValue is being confused with missing_value (one to fill empty space, one to indicate no data, which may or may not be the same). I guess this confusion isn't ideal in netCDF world either, but now people are somewhat used to it, and as it's optional, it's not too bad.

So if xarray's zarr ties fill_value to _FillValue, it's also tied to missing_value and we inherit all the confusion. But to make things work (think especially of bool type data), fill_value must be disconnected from missing_value. This either can be done

  • by disconnecting fill_value from _FillValue or
  • by disconnecting _FillValue from missing_value

I'd guess the former would be simpler, the latter would be more correct / elegant.

@rabernat
Copy link
Contributor

I am the one who originally implemented the current logic (equating Zarr's fill_value with CF _FillValue), and I am quite comfortable agreeing with what everyone has said above: the currently logic is wrong and makes no sense. 😆 When I wrote this code, I just naively saw the name "fill value" and assumed that it was the same thing, without really thinking through the semantics in detail, as has been done above.

I believe we can and should fix this by changing Xarray's behavior, with minimal impact on users.

A useful concept for me is to realize that the Zarr / Xarray stack actually has two distinct layers of encoding (Zarr's own "filters", as opposed to the NetCDF / Xarray stack, which only has one (implemented by Xarray). Using CF conventions with Zarr creates many ambiguities, as it is not always clear where in the stack the encoding / decoding should be applied

Converting _FillValue to NaN is an example of a NetCDF-specific decoding that Xarray applies with reading both NetCDF and Zarr. It makes sense for NetCDF. But it doesn't make sense to do the same with Zarr's fill_value. It may be the case that Zarr data with values equal to fill_value is perfectly valid, not to be interpreted as "missing data". The Zarr fill_value is Zarr specific and should not receive special treatment in Xarray, other than keeping track of it in encoding.

To recreate the behavior we have today, we could set the Zarr array attribute missing_value to be equal to the array fill_value.

I would be inclined not to worry too much about backwards compatibility with existing behavior. This needs to be fixed and the sooner we do so, the better off everyone will be.

@rabernat
Copy link
Contributor

I just spent the past hour re-reading all of the NetCDF and CF documentation on this, and reached the same conclusion that @dcherian did a few weeks ago (#5475 (comment)): this is a mess. It's truly ambiguous how to interpret _FillValue.

From the CF Conventions:

The scalar attribute with the name _FillValue and of the same type as its variable is recognized by the netCDF library as the value used to pre-fill disk space allocated to the variable... The _FillValue should be outside the range specified by valid_range (if used) for a variable... Applications that process variables that have attributes to indicate both a transformation (via a scale and/or offset) and missing values should first check that a data value is valid, and then apply the transformation. Note that values that are identified as missing should not be transformed. Since the missing value is outside the valid range it is possible that applying a transformation to it could result in an invalid operation.

This is how we do decoding today in Xarray--first mask _FillValue, then scale

if mask_and_scale:
for coder in [
variables.CFMaskCoder(),
variables.CFScaleOffsetCoder(),
]:
var = coder.decode(var, name=name)

However, missing_value is treated differently, according to the NUG

This attribute is not treated in any special way by the library or conforming generic applications, but is often useful documentation and may be used by specific applications. The missing_value attribute can be a scalar or vector containing values indicating missing data. These values should all be outside the valid range so that generic applications will treat them as missing.

When scale_factor and add_offset are used for packing, the value(s) of the missing_value attribute should be specified in the domain of the data in the file (the packed data), so that missing values can be detected before the scale_factor and add_offset are applied.

This means that we should be reversing the order of the decoding operations depending on whether we are masking a _FillValue (in raw data space) or missing_value in scaled data space. AFAICT this does not happen in Xarray. Instead, we treat the two as identical and raise an error if they differ:

if fv_exists and mv_exists and not duck_array_ops.allclose_or_equiv(fv, mv):
raise ValueError(
f"Variable {name!r} has conflicting _FillValue ({fv}) and missing_value ({mv}). Cannot encode data."
)

@rabernat
Copy link
Contributor

Final thought in my rant: another reason this is very hard to solve is the lack of true missing value support in Xarray. We have no good way to actually mask int data. Since int has no obvious sentinel value for missing values (unlike float, which has NaN), we have to resort to treating Zarr's fill_value as a mask. That, combined with the default fill_value of 0, is the cause of the bug described by @TomAugspurger above.

Ok, enough complaining, now time to think about how to actually solve this. 😆

A simple workaround that would solve the most obvious bug (int array with 0 as missing values) could be to just stop interpreting fill_value as missing data for int Zarr arrays. We could perhaps introduce an internal interpret_fill_value_as_missing_data flag in CFMaskCoder and toggle this on and off in different scenarios.

A more comprehensive fix would involve a deep rethinking of how we handle missing data in both Xarray and Zarr.

@keewis
Copy link
Collaborator

keewis commented Sep 29, 2024

indeed, the best way to resolve this would involve having true missing value support in our in-memory arrays (and adapting skipna to do the right thing for those)

There's two options (as described in NEP 26): masked arrays (i.e. a additional mask array directly on the array) and bit-pattern based missing values on the dtypes.

For the former, there's several promising attempts, most recently marray (as I discovered earlier today, this is already works but is incomplete, see #1194 (comment) – to know how much we'll have to either investigate manually or make progress on xarray-array-testing).

For the latter, we'd need to have someone dive into the custom dtype API that has already been used to define the new string dtypes. I also played around with that earlier today, but it looks like it will take me a lot more time to figure out since I don't really know the python and numpy C APIs too well. (In any case I don't want to be the bottleneck, so if someone wants to help out / take over I'd be happy to collaborate – and I'm not involved in that but I guess the same goes for marray)

Edit: though if we want to keep this focused on encoding / decoding I'll be happy to continue this discussion in #1194.

@TomAugspurger
Copy link
Contributor

#9515 (comment) has a suggestion / question to set xarray's default zarr version to 2, to punt on this issue a bit.

@rabernat
Copy link
Contributor

Tom, we are quite motivated to find a way to write V3, since it is the only thing that works with Icechunk.

I guess I'm failing to see why this is such a problem for V3 but not V2. Is it that V3 doesn't allow an empty fill value?

@TomAugspurger
Copy link
Contributor

Ah of course I forgot about that. I was hesitant to suggest xarray not follow the library default in the first place. So let's work through the transition pains.

Is it that V3 doesn't allow an empty fill value?

Correct. So it's a problem for both v2 and v3, just more pronounced in v3 since fill_value must be specified in the array metadata.

@rabernat
Copy link
Contributor

Sorry if what I'm saying below is obvious to people. Here's a summary of how it works in V2 vs V3:

Zarr 2.17.1

import zarr
a = zarr.create(shape=10, chunks=10, dtype=int)
# fill_value set to 0 by default
assert a.fill_value == 0
assert a[0] == 0
# None allowed as fill_value
b = zarr.create(shape=10, chunks=10, dtype=int, fill_value=None)
assert b.fill_value is None
# however, 0 is still being used as a fill value (no data have been written); this seems weird
# can we count on this behavior?
assert b[0] == 0 

The problem is that Xarray is using the presence of the fill_value attribute to determine whether to apply a mask, via setting the CF _FillValue attribute

if zarr_array.fill_value is not None:
attributes["_FillValue"] = zarr_array.fill_value

An, on the encoding side, setting fill_value as follows

fill_value = attrs.pop("_FillValue", None)
if v.encoding == {"_FillValue": None} and fill_value is None:
v.encoding = {}

Here's how it works in V3 (3.0.0a7.dev4+ga3221245)

import zarr
# fill_value set to 0 by default, same as V3
assert a.fill_value == 0  # passes; actually it is np.int64(0)
assert a[0] == 0
b = zarr.create(shape=10, chunks=10, dtype=int, fill_value=None)
# fails! it is still np.int64(0)
assert b.fill_value is None
# passes
assert b[0] == 0

So even though the V3 array behaves the same as the V2 array, it lacks the ability to set fill_value to None. This means that V3 always triggers Xarray's masking behavior.

Having written this summary, the answer seems a bit more clear: we need another way for Zarr to tell Xarray whether to apply masking.

Here's an idea: for V3 arrays (and maybe even also V2), we can use the _FillValue attribute for this. Rather than removing it from the attrs:

fill_value = attrs.pop("_FillValue", None)

...we just do

fill_value = attrs.get("_FillValue", None)

If this attr is missing, we will never apply a mask, regardless of what the Zarr .fill_value is. Effectively, we are using _FillValue as a sentinel attr to hint that the data should be masked, decoupling this from the actual Zarr fill_value.

Some questions that arise:

  • Would applying this retroactively to V2 data break existing datasets? I think so. So perhaps we would only want to do it for V3.
  • What if _FillValue and the array fill_value are different? Should that be allowed?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 30, 2024

however, 0 is still being used as a fill value (no data have been written); this seems weird
can we count on this behavior?

zarr-developers/zarr-python#2274 is a draft PR restoring that behavior on zarr-python 3.x for zarr-v2. But for zarr-v3, fill_value should always be non-None, so xarray can't rely on it in general.

Here's an idea: for V3 arrays [...]

That sounds reasonable to me. Users who want the masking behavior can explicitly set _FillValue as an attribute / encoding.

I do think it has the potential to break reading some v2 datasets, where people were relying on this behavior. FWIW, this is surprising enough to me that I'd be comfortable calling it a bug fix, but I'll leave that up to the experts.

@rabernat
Copy link
Contributor

rabernat commented Oct 5, 2024

I'm planning to take a swing at this over the weekend.

@d70-t
Copy link
Contributor Author

d70-t commented Oct 7, 2024

Here's an idea: for V3 arrays (and maybe even also V2), we can use the _FillValue attribute for this. Rather than removing it from the attrs

I'd vote for this (I guess that's also what I tried to outline in #5475 (comment)).

Would applying this retroactively to V2 data break existing datasets? I think so. So perhaps we would only want to do it for V3.

I think it could break datasets (but maybe that's fine).

  • If I get it right, writing _FillValue into new v2 datasets shouldn't hurt at all.
  • Reading a dataset where fill_value and _FillValue is present and preferring _FillValue over fill_value to set values to NaN in case of a difference between both should also be ok I guess (?). It might create problems when reading new-style datasets with old-style readers...
  • Not using a lone fill_value to mask values probably would be noticed, but I'd still aggree with @TomAugspurger that it probably should be considered a bugfix.

One might want to explicitly write _FillValue as None to indicate that one wants to opt-in into a potential new behaviour, where fill_value is not used anymore to mask values.

What if _FillValue and the array fill_value are different? Should that be allowed?

In my opinion, this should definitely be allowed. The two serve different purposes. One might however spend a little thought, if _FillValue should have a different name (e.g. missing_value)...

@rabernat
Copy link
Contributor

Over in #9552 I have implemented a fix for this issue, with the new approach to fill_value in Zarr V3. It keeps things unchanged for V2 but uses _FillValue attributes in V3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-CF conventions topic-zarr Related to zarr storage library
Projects
None yet
Development

No branches or pull requests

7 participants