You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fromdatasets.distributedimportsplit_dataset_by_nodeids=ds.to_iterable_dataset(num_shards=512)
ids=ids.shuffle(buffer_size=10_000) # will shuffle the shards order and use a shuffle buffer when you start iteratingids=split_dataset_by_node(ds, world_size=8, rank=0) # will keep only 512 / 8 = 64 shards from the shuffled lists of shards when you start iteratingdataloader=torch.utils.data.DataLoader(ids, num_workers=4) # will assign 64 / 4 = 16 shards from this node's list of shards to each worker when you start iteratingforexampleinids:
pass
This code would - I think - raise an error due to the lack of an explicit seed:
"The dataset doesn't have a fixed random seed across nodes to shuffle and split the list of dataset shards by node. "
"Please pass e.g. `seed=42` in `.shuffle()` to make all the nodes use the same seed. "
)
Steps to reproduce the bug
Run example code
Expected behavior
Add explicit seeding to example code
Environment info
latest datasets
The text was updated successfully, but these errors were encountered:
alex-hh
changed the title
Iterable dataset ddp shuffling example should set explicit seed
Set explicit seed in iterable dataset ddp shuffling example
Sep 23, 2024
Describe the bug
In the examples section of the iterable dataset docs https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset
the ddp example shuffles without seeding
This code would - I think - raise an error due to the lack of an explicit seed:
datasets/src/datasets/iterable_dataset.py
Lines 1707 to 1711 in 2eb4edb
Steps to reproduce the bug
Run example code
Expected behavior
Add explicit seeding to example code
Environment info
latest datasets
The text was updated successfully, but these errors were encountered: