-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IterableDataset raises exception instead of retrying #6843
Comments
Thanks for reporting! I've opened a PR with a fix. |
Thanks, @mariosasko! Related question (although I guess this is a feature request): could we have some kind of exponential back-off for these retries? Here's my reasoning:
There actually already exists an implementation for (clipped) exponential backoff in the HuggingFace suite (here), but I don't think it is used here. The requirements are basically that you have an initial minimum waiting time and a maximum waiting time, and with each retry, the waiting time is doubled. We don't want to overload your servers with needless retries, especially when they're down 😅 |
Oh, I've just remembered that we added retries to the I agree with the exponential backoff suggestion, so I'll open another PR. |
@mariosasko The call you linked indeed points to the implementation I linked in my previous comment, yes, but it has no configurability. Arguably, you want to have this hidden backoff under the hood that catches small network disturbances on the time scale of seconds -- perhaps even with hardcoded limits as is the case currently -- but you also still want to have a separate backoff on top of that with the configurability as suggested by @lhoestq in the comment I linked. My particular use-case is that I'm streaming a dataset while training on a university cluster with a very long scheduling queue. This means that when the backoff runs out of retries (which happens in under 30 seconds with the call you linked), I lose my spot on the cluster and have to queue for a whole day or more. Ideally, I should be able to specify that I want to retry for 2 to 3 hours but with more and more time between requests, so that I can smooth over hours-long outages without a setback of days. |
I also have my runs crash a surprising amount due to the dataloader crashing because of the hub, some way to address this would be nice. |
@mariosasko The implementation for retries is still broken and there is still no exponential back-off. HuggingFace has a two-tiered back-off:
datasets/src/datasets/utils/file_utils.py Lines 822 to 841 in 65f6eb5
This still does not catch the correct exceptions and hence no backoff happens at all which means that as soon as the hub is out for more than half a minute, processes will already start failing. Here is a stack trace of an uncaught exception:
requests.exceptions.ReadTimeout is not caught and hence the code fails after 0 retries. |
I merged a fix for this, thanks for reporting ! It will now retry on any |
Describe the bug
In light of the recent server outages, I decided to look into whether I could somehow wrap my IterableDataset streams to retry rather than error out immediately. To my surprise,
datasets
already supports retries. Since a commit by @lhoestq last week, that code lives here:https://github.com/huggingface/datasets/blob/fe2bea6a4b09b180bd23b88fe96dfd1a11191a4f/src/datasets/utils/file_utils.py#L1097C1-L1111C19
If GitHub code snippets still aren't working, here's a copy:
With the latest outage, the end of my stack trace looked like this:
Indeed, the code for retries only catches
ClientError
s andTimeoutError
s, and all other exceptions, including HuggingFace's own custom HTTP error class, are not caught. Nothing is retried, and instead the exception is propagated upwards immediately.Steps to reproduce the bug
Not sure how you reproduce this. Maybe unplug your Ethernet cable while streaming a dataset; the issue is pretty clear from the stack trace.
Expected behavior
All HTTP errors while iterating a streamable dataset should cause retries.
Environment info
Output from
datasets-cli env
:datasets
version: 2.18.0huggingface_hub
version: 0.20.3fsspec
version: 2023.10.0The text was updated successfully, but these errors were encountered: