Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset.with_format behaves inconsistently with documentation #6950

Closed
iansheng opened this issue Jun 4, 2024 · 2 comments
Closed

Dataset.with_format behaves inconsistently with documentation #6950

iansheng opened this issue Jun 4, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation

Comments

@iansheng
Copy link

iansheng commented Jun 4, 2024

Describe the bug

The actual behavior of the interface Dataset.with_format is inconsistent with the documentation.
https://huggingface.co/docs/datasets/use_with_pytorch#n-dimensional-arrays
https://huggingface.co/docs/datasets/v2.19.0/en/use_with_tensorflow#n-dimensional-arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor.
A TensorFlow formatted dataset outputs a RaggedTensor instead of a single tensor.

But I get a single tensor by default, which is inconsistent with the description.

Actually the current behavior seems more reasonable to me. Therefore, the document needs to be modified.

Steps to reproduce the bug

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
        [3, 4]])}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 2],
       [3, 4]])>}

Expected behavior

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}

Environment info

datasets==2.19.1
torch==2.1.0
tensorflow==2.13.1

@lhoestq
Copy link
Member

lhoestq commented Jun 4, 2024

Hi ! It seems the documentation was outdated in this paragraph

I fixed it here: #6956

@albertvillanova albertvillanova added the documentation Improvements or additions to documentation label Jun 5, 2024
@iansheng
Copy link
Author

Fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants