Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataTree.to_dict() method does not behave as expected #9611

Open
5 tasks done
veni-vidi-vici-dormivi opened this issue Oct 11, 2024 · 4 comments
Open
5 tasks done

DataTree.to_dict() method does not behave as expected #9611

veni-vidi-vici-dormivi opened this issue Oct 11, 2024 · 4 comments
Labels
bug topic-DataTree Related to the implementation of a DataTree class

Comments

@veni-vidi-vici-dormivi
Copy link

What happened?

I am working with DataTree and find it very useful! However, I think I found a bug in the .to_dict() method.
When I build a DataTree from a dict with DataTree.from_dict() and then want to get the dict again with DataTree.to_dict() the resulting dict differs from the original one.

What did you expect to happen?

I expected the two dicts to be the same. Instead, the root node receives a dict entry with an empty Dataset. I argue that empty nodes should not appear in the dict.

Minimal Complete Verifiable Example

import xarray as xr
from xarray.core.datatree import DataTree

example_dict = {"set1": xr.Dataset({"var1": xr.DataArray([1, 2, 3], dims = "time")}),
                "set2": xr.Dataset({"var1": xr.DataArray([7, 8, 9], dims = "time")})}
dt = DataTree.from_dict(example_dict)
dt.to_dict() == example_dict # False
# should be True imo

# or even just
DataTree.from_dict({}).to_dict == {} # False

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

:488: RuntimeWarning: numpy.ndarray size changed, may indicate binary incompatibility. Expected 16 from C header, got 96 from PyObject

INSTALLED VERSIONS

commit: None
python: 3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:54:21) [Clang 16.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 23.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: None
LOCALE: (None, 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2024.9.0
pandas: 2.2.0
numpy: 1.26.4
scipy: 1.12.0
netCDF4: 1.7.1
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: 1.6.3
nc_time_axis: 1.4.1
iris: None
bottleneck: None
dask: 2024.2.0
distributed: 2024.2.0
matplotlib: 3.8.3
cartopy: 0.22.0
seaborn: None
numbagg: None
fsspec: 2024.2.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 69.1.0
pip: 24.0
conda: None
pytest: 8.0.1
mypy: None
IPython: 8.21.0
sphinx: 7.2.6

@veni-vidi-vici-dormivi veni-vidi-vici-dormivi added bug needs triage Issue that has not been reviewed by xarray team member labels Oct 11, 2024
Copy link

welcome bot commented Oct 11, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas
Copy link
Member

Thanks for this thoughtful issue!

I argue that empty nodes should not appear in the dict.

In the case of the root node you're right that that extra dict entry is redundant. But in general that's not the case: imagine a tree with an empty leaf node - if we dropped that from the dictionary then called DataTree.from_dict we would completely lose that node, creating a tree with a different structure instead of round-tripping. For that reason we can't just have all empty nodes not appear in the dict.

However I suppose we could have all empty "intermediate nodes" not appear in the dict, as they get automatically reconstructed... I'm not sure that's a good idea either though - just having one dict entry per node always is a lot simpler, even though it does look weird in your case.

@TomNicholas
Copy link
Member

TomNicholas commented Oct 12, 2024

Basically in order to have the round-tripping property

assert tree == DataTree.from_dict(tree.to_dict())

the rule has to be either:
a) create one dict entry per node for all nodes, even if empty,
b) create dict entry per non-empty node, plus one for every empty leaf node.

otherwise some empty nodes will be lost.

@TomNicholas TomNicholas added topic-DataTree Related to the implementation of a DataTree class and removed needs triage Issue that has not been reviewed by xarray team member labels Oct 13, 2024
@veni-vidi-vici-dormivi
Copy link
Author

Yes, I see that removing empty leaf nodes is problematic. I also realized that two different dictionaries can lead to the same DataTree:

dict1 ={"set1": xr.Dataset({"var1": xr.DataArray([1, 2, 3], dims = "time")}),
       "set2": xr.Dataset({"var1": xr.DataArray([7, 8, 9], dims = "time")})}
dict2 = {"/": xr.Dataset(),
       "/set1": xr.Dataset({"var1": xr.DataArray([1, 2, 3], dims = "time")}),
       "/set2": xr.Dataset({"var1": xr.DataArray([7, 8, 9], dims = "time")})}
dt1 = DataTree.from_dict(dict1)
dt2 = DataTree.from_dict(dict2)

xr.testing.assert_identical(dt1, dt2)

and therefore, the roundtrip property cannot work universally...

The reason why the creation of empty datasets was a problem for me in the first place is because this makes it harder to apply my functionality that worked on dictionaries of xr.Datasets before to DataTrees. Particularly, the creation of empty datasets in to_dict but also in .subtree or .items() can lead to unexpected behavior when contents of a DataTree should lead to the output of a single Dataset or DataArray, such as maybe #9349 . However, I see now how it might be better to handle the problem there and not in the to_dict method itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug topic-DataTree Related to the implementation of a DataTree class
Projects
None yet
Development

No branches or pull requests

2 participants