IterableDataset strange deadlock #7147

jonathanasdf · 2024-09-12T18:59:33Z

Describe the bug

import datasets
import torch.utils.data


num_shards = 1024


def gen(shards):
    for shard in shards:
        if shard < 25:
            yield {"shard": shard}


def main():
    dataset = datasets.IterableDataset.from_generator(
        gen,
        gen_kwargs={"shards": list(range(num_shards))},
    )
    dataset = dataset.shuffle(buffer_size=1)
    dataset = datasets.interleave_datasets(
        [dataset, dataset], probabilities=[1, 0], stopping_strategy="all_exhausted"
    )
    dataset = dataset.shuffle(buffer_size=1)

    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=8,
        num_workers=8,
    )

    for i, batch in enumerate(dataloader):
        print(batch)
        if i >= 10:
            break
    print()


if __name__ == "__main__":
    for _ in range(100):
        main()

Steps to reproduce the bug

Running the script above, at some point it will freeze.

Changing num_shards from 1024 to 25 avoids the issue
Commenting out the final shuffle avoids the issue
Commenting out the interleave_datasets call avoids the issue

As an aside, if you comment out just the final shuffle, the output from interleave_datasets is not shuffled at all even though there's the shuffle before it. So something about that shuffle config is not being propagated to interleave_datasets.

Expected behavior

The script should not freeze.

Environment info

datasets version: 3.0.0
Platform: macOS-14.6.1-arm64-arm-64bit
Python version: 3.12.5
huggingface_hub version: 0.24.7
PyArrow version: 17.0.0
Pandas version: 2.2.2
fsspec version: 2024.6.1

I observed this with 2.21.0 initially, then tried upgrading to 3.0.0 and could still repro.

The text was updated successfully, but these errors were encountered:

lhoestq · 2024-09-20T15:40:16Z

Yes interleave_datasets seems to have an issue with shuffling, could you open a new issue on this ?

Then regarding the deadlock, it has to do with interleave_dataset with probabilities=[1, 0] with workers that may contain an empty dataset in first position (it can be empty since you distribute 1024 shard to 8 workers, so some workers may not have an example that satisfies your condition if shard < 25). It creates an infinite loop, trying to get samples from empty datasets with probability 1.

jonathanasdf · 2024-09-20T18:00:00Z

Opened #7156

Can the deadlock be fixed somehow? The point of IterableDataset is so we don't need to preload the entire dataset, which loses some meaning if we need to see how many examples are in the dataset in order to set shards correctly.

jonathanasdf · 2024-09-20T18:01:24Z

And it is kinda strange that Commenting out the final shuffle avoids the issue since if the infinite loop is inside interleave_datasets you'd expect that to happen regardless of the additional shuffle call?

Edit: oh I guess without the shuffle it's guaranteed every worker gets something, but the shuffle makes it so some workers could have nothing

~~Edit2: maybe the shuffle can be changed so initially it gives one example to each worker, and only starts the random shuffle after that~~ wait it's not about the workers not getting any shards, it's about a worker getting shards but all of the shards it gets are empty shards

Edit3: If it's trying to get samples from empty datasets, it should be getting back a StopIteration -- and "all_exhausted" should mean it eventually discovers all its datasets are empty, and then it should just raise a StopIteration itself. So it seems like there is a reasonable behavior result for this?

lhoestq · 2024-09-21T15:20:11Z

well the second dataset passed to interleave_datasets is never exhausted, since it's never sampled. But we could also state that the stream of examples from the second dataset is empty if it has probability 0, so I opened #7157 to fix the infinite loop issue by ignoring datasets with probability 0, let me know what you think !

jonathanasdf · 2024-09-21T17:37:34Z

Thanks for taking a look!

I think you're right that this is ultimately an issue that the user opts into by specifying a dataset with probability 0, because the user is basically saying "I want to force this interleave_datasets call to run forever" and yet one of the workers can end up having only empty shards to mix...

That said it's probably not a good idea to randomly change the behavior of interleave_datasets with probability 0, I can't be the only one that uses it to repeat many different datasets (since there is no datasets.repeat() function). https://xkcd.com/1172/

I think just the knowledge that filtering out probability 0 datasets fixes the deadlock is good enough for me. I can filter it out on my side and add a restart loop around the dataloader instead.

Thanks again for investigating.

lhoestq · 2024-09-23T09:32:26Z

Ok I see ! We can also add .repeat() as well

lhoestq mentioned this issue Sep 21, 2024

Fix zero proba interleave datasets #7157

Closed

jonathanasdf closed this as completed Sep 21, 2024

alex-hh mentioned this issue Oct 2, 2024

Add repeat() for iterable datasets #7192

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterableDataset strange deadlock #7147

IterableDataset strange deadlock #7147

jonathanasdf commented Sep 12, 2024

lhoestq commented Sep 20, 2024 •

edited

Loading

jonathanasdf commented Sep 20, 2024

jonathanasdf commented Sep 20, 2024 •

edited

Loading

lhoestq commented Sep 21, 2024

jonathanasdf commented Sep 21, 2024

lhoestq commented Sep 23, 2024

IterableDataset strange deadlock #7147

IterableDataset strange deadlock #7147

Comments

jonathanasdf commented Sep 12, 2024

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

lhoestq commented Sep 20, 2024 • edited Loading

jonathanasdf commented Sep 20, 2024

jonathanasdf commented Sep 20, 2024 • edited Loading

lhoestq commented Sep 21, 2024

jonathanasdf commented Sep 21, 2024

lhoestq commented Sep 23, 2024

lhoestq commented Sep 20, 2024 •

edited

Loading

jonathanasdf commented Sep 20, 2024 •

edited

Loading