`Dataset.with_format` behaves inconsistently with documentation #6950

iansheng · 2024-06-04T09:18:32Z

Describe the bug

The actual behavior of the interface Dataset.with_format is inconsistent with the documentation.
https://huggingface.co/docs/datasets/use_with_pytorch#n-dimensional-arrays
https://huggingface.co/docs/datasets/v2.19.0/en/use_with_tensorflow#n-dimensional-arrays

If your dataset consists of N-dimensional arrays, you will see that by default they are considered as nested lists.
In particular, a PyTorch formatted dataset outputs nested lists instead of a single tensor.
A TensorFlow formatted dataset outputs a RaggedTensor instead of a single tensor.

But I get a single tensor by default, which is inconsistent with the description.

Actually the current behavior seems more reasonable to me. Therefore, the document needs to be modified.

Steps to reproduce the bug

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': tensor([[1, 2],
        [3, 4]])}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.Tensor: shape=(2, 2), dtype=int64, numpy=
array([[1, 2],
       [3, 4]])>}

Expected behavior

>>> from datasets import Dataset
>>> data = [[[1, 2],[3, 4]],[[5, 6],[7, 8]]]
>>> ds = Dataset.from_dict({"data": data})
>>> ds = ds.with_format("torch")
>>> ds[0]
{'data': [tensor([1, 2]), tensor([3, 4])]}
>>> ds = ds.with_format("tf")
>>> ds[0]
{'data': <tf.RaggedTensor [[1, 2], [3, 4]]>}

Environment info

datasets==2.19.1
torch==2.1.0
tensorflow==2.13.1

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-06-04T16:40:40Z

Hi ! It seems the documentation was outdated in this paragraph

I fixed it here: #6956

iansheng · 2024-06-25T08:05:49Z

Fixed.

albertvillanova added the documentation Improvements or additions to documentation label Jun 5, 2024

iansheng closed this as completed Jun 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Dataset.with_format` behaves inconsistently with documentation #6950

`Dataset.with_format` behaves inconsistently with documentation #6950

iansheng commented Jun 4, 2024

lhoestq commented Jun 4, 2024

iansheng commented Jun 25, 2024

Dataset.with_format behaves inconsistently with documentation #6950

Dataset.with_format behaves inconsistently with documentation #6950

Comments

iansheng commented Jun 4, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Jun 4, 2024

iansheng commented Jun 25, 2024

`Dataset.with_format` behaves inconsistently with documentation #6950

`Dataset.with_format` behaves inconsistently with documentation #6950