Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IterableDataset strange deadlock #7147

Closed
jonathanasdf opened this issue Sep 12, 2024 · 6 comments
Closed

IterableDataset strange deadlock #7147

jonathanasdf opened this issue Sep 12, 2024 · 6 comments

Comments

@jonathanasdf
Copy link

Describe the bug

import datasets
import torch.utils.data


num_shards = 1024


def gen(shards):
    for shard in shards:
        if shard < 25:
            yield {"shard": shard}


def main():
    dataset = datasets.IterableDataset.from_generator(
        gen,
        gen_kwargs={"shards": list(range(num_shards))},
    )
    dataset = dataset.shuffle(buffer_size=1)
    dataset = datasets.interleave_datasets(
        [dataset, dataset], probabilities=[1, 0], stopping_strategy="all_exhausted"
    )
    dataset = dataset.shuffle(buffer_size=1)

    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=8,
        num_workers=8,
    )

    for i, batch in enumerate(dataloader):
        print(batch)
        if i >= 10:
            break
    print()


if __name__ == "__main__":
    for _ in range(100):
        main()

Steps to reproduce the bug

Running the script above, at some point it will freeze.

  • Changing num_shards from 1024 to 25 avoids the issue
  • Commenting out the final shuffle avoids the issue
  • Commenting out the interleave_datasets call avoids the issue

As an aside, if you comment out just the final shuffle, the output from interleave_datasets is not shuffled at all even though there's the shuffle before it. So something about that shuffle config is not being propagated to interleave_datasets.

Expected behavior

The script should not freeze.

Environment info

  • datasets version: 3.0.0
  • Platform: macOS-14.6.1-arm64-arm-64bit
  • Python version: 3.12.5
  • huggingface_hub version: 0.24.7
  • PyArrow version: 17.0.0
  • Pandas version: 2.2.2
  • fsspec version: 2024.6.1

I observed this with 2.21.0 initially, then tried upgrading to 3.0.0 and could still repro.

@lhoestq
Copy link
Member

lhoestq commented Sep 20, 2024

Yes interleave_datasets seems to have an issue with shuffling, could you open a new issue on this ?

Then regarding the deadlock, it has to do with interleave_dataset with probabilities=[1, 0] with workers that may contain an empty dataset in first position (it can be empty since you distribute 1024 shard to 8 workers, so some workers may not have an example that satisfies your condition if shard < 25). It creates an infinite loop, trying to get samples from empty datasets with probability 1.

@jonathanasdf
Copy link
Author

Opened #7156

Can the deadlock be fixed somehow? The point of IterableDataset is so we don't need to preload the entire dataset, which loses some meaning if we need to see how many examples are in the dataset in order to set shards correctly.

@jonathanasdf
Copy link
Author

jonathanasdf commented Sep 20, 2024

And it is kinda strange that Commenting out the final shuffle avoids the issue since if the infinite loop is inside interleave_datasets you'd expect that to happen regardless of the additional shuffle call?

Edit: oh I guess without the shuffle it's guaranteed every worker gets something, but the shuffle makes it so some workers could have nothing

Edit2: maybe the shuffle can be changed so initially it gives one example to each worker, and only starts the random shuffle after that wait it's not about the workers not getting any shards, it's about a worker getting shards but all of the shards it gets are empty shards

Edit3: If it's trying to get samples from empty datasets, it should be getting back a StopIteration -- and "all_exhausted" should mean it eventually discovers all its datasets are empty, and then it should just raise a StopIteration itself. So it seems like there is a reasonable behavior result for this?

@lhoestq
Copy link
Member

lhoestq commented Sep 21, 2024

well the second dataset passed to interleave_datasets is never exhausted, since it's never sampled. But we could also state that the stream of examples from the second dataset is empty if it has probability 0, so I opened #7157 to fix the infinite loop issue by ignoring datasets with probability 0, let me know what you think !

@jonathanasdf
Copy link
Author

Thanks for taking a look!

I think you're right that this is ultimately an issue that the user opts into by specifying a dataset with probability 0, because the user is basically saying "I want to force this interleave_datasets call to run forever" and yet one of the workers can end up having only empty shards to mix...

That said it's probably not a good idea to randomly change the behavior of interleave_datasets with probability 0, I can't be the only one that uses it to repeat many different datasets (since there is no datasets.repeat() function). https://xkcd.com/1172/

I think just the knowledge that filtering out probability 0 datasets fixes the deadlock is good enough for me. I can filter it out on my side and add a restart loop around the dataloader instead.

Thanks again for investigating.

@lhoestq
Copy link
Member

lhoestq commented Sep 23, 2024

Ok I see ! We can also add .repeat() as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants