Set explicit seed in iterable dataset ddp shuffling example #7163

alex-hh · 2024-09-23T11:34:06Z

Describe the bug

In the examples section of the iterable dataset docs https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.IterableDataset
the ddp example shuffles without seeding

from datasets.distributed import split_dataset_by_node
ids = ds.to_iterable_dataset(num_shards=512)
ids = ids.shuffle(buffer_size=10_000)  # will shuffle the shards order and use a shuffle buffer when you start iterating
ids = split_dataset_by_node(ds, world_size=8, rank=0)  # will keep only 512 / 8 = 64 shards from the shuffled lists of shards when you start iterating
dataloader = torch.utils.data.DataLoader(ids, num_workers=4)  # will assign 64 / 4 = 16 shards from this node's list of shards to each worker when you start iterating
for example in ids:
    pass

This code would - I think - raise an error due to the lack of an explicit seed:

datasets/src/datasets/iterable_dataset.py

Lines 1707 to 1711 in 2eb4edb

    
           if distributed and distributed.world_size > 1 and shuffling and shuffling._original_seed is None: 
        
               raise RuntimeError( 
        
                   "The dataset doesn't have a fixed random seed across nodes to shuffle and split the list of dataset shards by node. " 
        
                   "Please pass e.g. `seed=42` in `.shuffle()` to make all the nodes use the same seed. " 
        
               )

Steps to reproduce the bug

Run example code

Expected behavior

Add explicit seeding to example code

Environment info

latest datasets

lhoestq · 2024-09-24T14:40:04Z

thanks for reporting !

alex-hh changed the title ~~Iterable dataset ddp shuffling example should set explicit seed~~ Set explicit seed in iterable dataset ddp shuffling example Sep 23, 2024

lhoestq mentioned this issue Sep 24, 2024

fix docstring code example for distributed shuffle #7166

Merged

lhoestq closed this as completed in #7166 Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set explicit seed in iterable dataset ddp shuffling example #7163

Set explicit seed in iterable dataset ddp shuffling example #7163

alex-hh commented Sep 23, 2024 •

edited

Loading

lhoestq commented Sep 24, 2024

Set explicit seed in iterable dataset ddp shuffling example #7163

Set explicit seed in iterable dataset ddp shuffling example #7163

Comments

alex-hh commented Sep 23, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Sep 24, 2024

alex-hh commented Sep 23, 2024 •

edited

Loading