-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IterableDataset strange deadlock #7147
Comments
Yes Then regarding the deadlock, it has to do with interleave_dataset with probabilities=[1, 0] with workers that may contain an empty dataset in first position (it can be empty since you distribute 1024 shard to 8 workers, so some workers may not have an example that satisfies your condition |
Opened #7156 Can the deadlock be fixed somehow? The point of IterableDataset is so we don't need to preload the entire dataset, which loses some meaning if we need to see how many examples are in the dataset in order to set shards correctly. |
Edit: oh I guess without the shuffle it's guaranteed every worker gets something, but the shuffle makes it so some workers could have nothing
Edit3: If it's trying to get samples from empty datasets, it should be getting back a StopIteration -- and "all_exhausted" should mean it eventually discovers all its datasets are empty, and then it should just raise a StopIteration itself. So it seems like there is a reasonable behavior result for this? |
well the second dataset passed to interleave_datasets is never exhausted, since it's never sampled. But we could also state that the stream of examples from the second dataset is empty if it has probability 0, so I opened #7157 to fix the infinite loop issue by ignoring datasets with probability 0, let me know what you think ! |
Thanks for taking a look! I think you're right that this is ultimately an issue that the user opts into by specifying a dataset with probability 0, because the user is basically saying "I want to force this That said it's probably not a good idea to randomly change the behavior of I think just the knowledge that filtering out probability 0 datasets fixes the deadlock is good enough for me. I can filter it out on my side and add a restart loop around the dataloader instead. Thanks again for investigating. |
Ok I see ! We can also add .repeat() as well |
Describe the bug
Steps to reproduce the bug
Running the script above, at some point it will freeze.
num_shards
from 1024 to 25 avoids the issueAs an aside, if you comment out just the final shuffle, the output from interleave_datasets is not shuffled at all even though there's the shuffle before it. So something about that shuffle config is not being propagated to interleave_datasets.
Expected behavior
The script should not freeze.
Environment info
datasets
version: 3.0.0huggingface_hub
version: 0.24.7fsspec
version: 2024.6.1I observed this with 2.21.0 initially, then tried upgrading to 3.0.0 and could still repro.
The text was updated successfully, but these errors were encountered: