Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set explicit seed in iterable dataset ddp shuffling example #7163

Closed
alex-hh opened this issue Sep 23, 2024 · 1 comment · Fixed by #7166
Closed

Set explicit seed in iterable dataset ddp shuffling example #7163

alex-hh opened this issue Sep 23, 2024 · 1 comment · Fixed by #7166

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Sep 23, 2024

Describe the bug

In the examples section of the iterable dataset docs https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset
the ddp example shuffles without seeding

from datasets.distributed import split_dataset_by_node
ids = ds.to_iterable_dataset(num_shards=512)
ids = ids.shuffle(buffer_size=10_000)  # will shuffle the shards order and use a shuffle buffer when you start iterating
ids = split_dataset_by_node(ds, world_size=8, rank=0)  # will keep only 512 / 8 = 64 shards from the shuffled lists of shards when you start iterating
dataloader = torch.utils.data.DataLoader(ids, num_workers=4)  # will assign 64 / 4 = 16 shards from this node's list of shards to each worker when you start iterating
for example in ids:
    pass

This code would - I think - raise an error due to the lack of an explicit seed:

if distributed and distributed.world_size > 1 and shuffling and shuffling._original_seed is None:
raise RuntimeError(
"The dataset doesn't have a fixed random seed across nodes to shuffle and split the list of dataset shards by node. "
"Please pass e.g. `seed=42` in `.shuffle()` to make all the nodes use the same seed. "
)

Steps to reproduce the bug

Run example code

Expected behavior

Add explicit seeding to example code

Environment info

latest datasets

@alex-hh alex-hh changed the title Iterable dataset ddp shuffling example should set explicit seed Set explicit seed in iterable dataset ddp shuffling example Sep 23, 2024
@lhoestq
Copy link
Member

lhoestq commented Sep 24, 2024

thanks for reporting !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants