Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_unstacked_dataset unable to reconstruct original dimensions #9541

Open
aFarchi opened this issue Sep 24, 2024 · 1 comment
Open

to_unstacked_dataset unable to reconstruct original dimensions #9541

aFarchi opened this issue Sep 24, 2024 · 1 comment
Labels
needs triage Issue that has not been reviewed by xarray team member

Comments

@aFarchi
Copy link

aFarchi commented Sep 24, 2024

What is your issue?

Hello,

I am trying to stack/unstack a dataset. According to the doc, I am supposed to recover the original dataset, but this is not what I observe.

>>> import xarray as xr
>>> import numpy as np
>>> ds = xr.Dataset(
...     data_vars=dict(
...         var_a=(('sample', 'dim_a'), np.random.randn(2, 1)),
...         var_b=(('sample', 'dim_b'), np.random.randn(2, 4)),
...     ),
... )
>>> ds
<xarray.Dataset> Size: 80B
Dimensions:  (sample: 2, dim_a: 1, dim_b: 4)
Dimensions without coordinates: sample, dim_a, dim_b
Data variables:
    var_a    (sample, dim_a) float64 16B -0.5696 -0.8579
    var_b    (sample, dim_b) float64 64B 0.0585 -1.219 1.702 ... 1.244 0.7397

Stacking the dataset looks correct:

>>> stacked = ds.to_stacked_array('output_feature', sample_dims=('sample',))
>>> stacked
<xarray.DataArray 'var_a' (sample: 2, output_feature: 5)> Size: 80B
array([[-0.56958696,  0.058498  , -1.21899832,  1.70180735, -0.06674016],
       [-0.85787833,  1.86201164, -1.71474761,  1.24400992,  0.73965765]])
Coordinates:
  * output_feature  (output_feature) object 40B MultiIndex
  * variable        (output_feature) <U5 100B 'var_a' 'var_b' ... 'var_b'
  * dim_a           (output_feature) object 40B 0 nan nan nan nan
  * dim_b           (output_feature) object 40B nan 0 1 2 3
Dimensions without coordinates: sample

But unstacking seems incorrect:

>>> stacked.to_unstacked_dataset('output_feature')
<xarray.Dataset> Size: 176B
Dimensions:         (sample: 2, output_feature: 4)
Coordinates:
  * output_feature  (output_feature) object 32B MultiIndex
  * dim_a           (output_feature) object 32B nan nan nan nan
  * dim_b           (output_feature) object 32B 0 1 2 3
Dimensions without coordinates: sample
Data variables:
    var_a           (sample) float64 16B -0.5696 -0.8579
    var_b           (sample, output_feature) float64 64B 0.0585 ... 0.7397

var_a should have dimensions (sample, dim_a) and var_b should have (sample, dim_b).

The issue seems even worse when len(dim_a)>1:

>>> import xarray as xr
>>> import numpy as np
>>> ds = xr.Dataset(
...     data_vars=dict(
...         var_a=(('sample', 'dim_a'), np.random.randn(2, 2)),
...         var_b=(('sample', 'dim_b'), np.random.randn(2, 4)),
...     ),
... )
>>> stacked = ds.to_stacked_array('output_feature', sample_dims=('sample',))
>>> stacked.to_unstacked_dataset('output_feature', level=0)
<xarray.Dataset> Size: 336B
Dimensions:         (output_feature: 6, sample: 2)
Coordinates:
  * output_feature  (output_feature) object 48B MultiIndex
  * dim_a           (output_feature) object 48B 0 1 nan nan nan nan
  * dim_b           (output_feature) object 48B nan nan 0 1 2 3
Dimensions without coordinates: sample
Data variables:
    var_a           (sample, output_feature) float64 96B 0.6215 -1.72 ... nan
    var_b           (sample, output_feature) float64 96B nan nan ... 0.2421

Could it be related to the level argument of to_unstacked_dataset()?

Note that I have been using the last version for this test:

>>> xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.12.6 | packaged by conda-forge | (main, Sep 22 2024, 14:07:06) [Clang 17.0.6 ]
python-bits: 64
OS: Darwin
OS-release: 23.6.0
machine: arm64
processor: arm
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None

xarray: 2024.9.0
pandas: 2.2.3
numpy: 2.1.1
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
zarr: None
cftime: None
nc_time_axis: None
iris: None
bottleneck: None
dask: None
distributed: None
matplotlib: None
cartopy: None
seaborn: None
numbagg: None
fsspec: None
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 74.1.2
pip: 24.2
conda: None
pytest: None
mypy: None
IPython: None
sphinx: None
@aFarchi aFarchi added the needs triage Issue that has not been reviewed by xarray team member label Sep 24, 2024
Copy link

welcome bot commented Sep 24, 2024

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

1 participant