IterableDataset raises exception instead of retrying #6843

bauwenst · 2024-04-26T10:00:43Z

Describe the bug

In light of the recent server outages, I decided to look into whether I could somehow wrap my IterableDataset streams to retry rather than error out immediately. To my surprise, datasets already supports retries. Since a commit by @lhoestq last week, that code lives here:

https://github.com/huggingface/datasets/blob/fe2bea6a4b09b180bd23b88fe96dfd1a11191a4f/src/datasets/utils/file_utils.py#L1097C1-L1111C19

If GitHub code snippets still aren't working, here's a copy:

    def read_with_retries(*args, **kwargs):
        disconnect_err = None
        for retry in range(1, max_retries + 1):
            try:
                out = read(*args, **kwargs)
                break
            except (ClientError, TimeoutError) as err:
                disconnect_err = err
                logger.warning(
                    f"Got disconnected from remote data host. Retrying in {config.STREAMING_READ_RETRY_INTERVAL}sec [{retry}/{max_retries}]"
                )
                time.sleep(config.STREAMING_READ_RETRY_INTERVAL)
        else:
            raise ConnectionError("Server Disconnected") from disconnect_err
        return out

With the latest outage, the end of my stack trace looked like this:

...
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/download/streaming_download_manager.py", line 342, in read_with_retries
    out = read(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/gzip.py", line 301, in read
    return self._buffer.read(size)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/gzip.py", line 505, in read
    buf = self._fp.read(io.DEFAULT_BUFFER_SIZE)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/gzip.py", line 88, in read
    return self.file.read(size)
           ^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/fsspec/spec.py", line 1856, in read
    out = self.cache._fetch(self.loc, self.loc + length)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/fsspec/caching.py", line 189, in _fetch
    self.cache = self.fetcher(start, end)  # new block replaces old
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py", line 626, in _fetch_range
    hf_raise_for_status(r)
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 333, in hf_raise_for_status
    raise HfHubHTTPError(str(e), response=response) from e
huggingface_hub.utils._errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/datasets/allenai/c4/resolve/1588ec454efa1a09f29cd18ddd04fe05fc8653a2/en/c4-train.00346-of-01024.json.gz

Indeed, the code for retries only catches ClientErrors and TimeoutErrors, and all other exceptions, including HuggingFace's own custom HTTP error class, are not caught. Nothing is retried, and instead the exception is propagated upwards immediately.

Steps to reproduce the bug

Not sure how you reproduce this. Maybe unplug your Ethernet cable while streaming a dataset; the issue is pretty clear from the stack trace.

Expected behavior

All HTTP errors while iterating a streamable dataset should cause retries.

Environment info

Output from datasets-cli env:

datasets version: 2.18.0
Platform: Linux-4.18.0-513.24.1.el8_9.x86_64-x86_64-with-glibc2.28
Python version: 3.11.7
huggingface_hub version: 0.20.3
PyArrow version: 15.0.0
Pandas version: 2.2.0
fsspec version: 2023.10.0

The text was updated successfully, but these errors were encountered:

mariosasko · 2024-04-26T14:09:35Z

Thanks for reporting! I've opened a PR with a fix.

bauwenst · 2024-04-26T14:24:00Z

Thanks, @mariosasko! Related question (although I guess this is a feature request): could we have some kind of exponential back-off for these retries? Here's my reasoning:

If a one-time accidental error happens, you should retry immediately and will succeed immediately.
If the Hub has a small outage on the order of minutes, you don't want to retry on the order of hours.
If the Hub has a prologned outage of several hours, we don't want to keep retrying on the order of minutes.

There actually already exists an implementation for (clipped) exponential backoff in the HuggingFace suite (here), but I don't think it is used here.

The requirements are basically that you have an initial minimum waiting time and a maximum waiting time, and with each retry, the waiting time is doubled. We don't want to overload your servers with needless retries, especially when they're down 😅

mariosasko · 2024-04-26T15:36:18Z

Oh, I've just remembered that we added retries to the HfFileSystem in huggingface_hub 0.21.0 (see this), so I'll close the linked PR as we don't want to retry the retries :).

I agree with the exponential backoff suggestion, so I'll open another PR.

bauwenst · 2024-04-26T16:56:38Z

@mariosasko The call you linked indeed points to the implementation I linked in my previous comment, yes, but it has no configurability. Arguably, you want to have this hidden backoff under the hood that catches small network disturbances on the time scale of seconds -- perhaps even with hardcoded limits as is the case currently -- but you also still want to have a separate backoff on top of that with the configurability as suggested by @lhoestq in the comment I linked.

My particular use-case is that I'm streaming a dataset while training on a university cluster with a very long scheduling queue. This means that when the backoff runs out of retries (which happens in under 30 seconds with the call you linked), I lose my spot on the cluster and have to queue for a whole day or more. Ideally, I should be able to specify that I want to retry for 2 to 3 hours but with more and more time between requests, so that I can smooth over hours-long outages without a setback of days.

haydn-jones · 2024-04-30T13:14:12Z

I also have my runs crash a surprising amount due to the dataloader crashing because of the hub, some way to address this would be nice.

bauwenst · 2024-10-26T12:18:49Z

@mariosasko The implementation for retries is still broken and there is still no exponential back-off.

HuggingFace has a two-tiered back-off:

huggingface_hub.utils provides the low-level http_backoff function which is used for all HTTP requests. It retries first with 1 second delay, then 2, then 4, then 8, then 8, and then it crashes. This is not even half a minute of exponential backoff in total.
datasets.utils.file_utils provides a function _add_retries_to_file_obj_read_method that monkey-patches the read method of an HfFileSystemFile to have constant-time backoff on certain exceptions. The amount of retries and seconds between retries is customisable as explained by @lhoestq here. The implementation looks like this:

datasets/src/datasets/utils/file_utils.py

Lines 822 to 841 in 65f6eb5

    
           def read_with_retries(*args, **kwargs): 
        
               disconnect_err = None 
        
               for retry in range(1, max_retries + 1): 
        
                   try: 
        
                       out = read(*args, **kwargs) 
        
                       break 
        
                   except ( 
        
                       aiohttp.client_exceptions.ClientError, 
        
                       asyncio.TimeoutError, 
        
                       requests.exceptions.ConnectTimeout, 
        
                       requests.exceptions.ConnectionError, 
        
                   ) as err: 
        
                       disconnect_err = err 
        
                       logger.warning( 
        
                           f"Got disconnected from remote data host. Retrying in {config.STREAMING_READ_RETRY_INTERVAL}sec [{retry}/{max_retries}]" 
        
                       ) 
        
                       time.sleep(config.STREAMING_READ_RETRY_INTERVAL) 
        
               else: 
        
                   raise ConnectionError("Server Disconnected") from disconnect_err 
        
               return out

This still does not catch the correct exceptions and hence no backoff happens at all which means that as soon as the hub is out for more than half a minute, processes will already start failing. Here is a stack trace of an uncaught exception:

  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/iterable_dataset.py", line 268, in __iter__
    for key, pa_table in self.generate_tables_fn(**gen_kwags):
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/packaged_modules/json/json.py", line 123, in _generate_tables
    batch = f.read(self.config.chunksize)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 830, in read_with_retries
    out = read(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py", line 757, in read
    return super().read(length)
           ^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/fsspec/spec.py", line 1856, in read
    out = self.cache._fetch(self.loc, self.loc + length)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/fsspec/caching.py", line 189, in _fetch
    self.cache = self.fetcher(start, end)  # new block replaces old
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py", line 713, in _fetch_range
    r = http_backoff(
        ^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 326, in http_backoff
    raise err
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 307, in http_backoff
    response = session.request(method=method, url=url, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 93, in send
    return super().send(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/miniconda3/envs/draft/lib/python3.11/site-packages/requests/adapters.py", line 713, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: (ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: 3d145d98-e4fa-442f-bead-6be060e60d59)')

requests.exceptions.ReadTimeout is not caught and hence the code fails after 0 retries.

lhoestq · 2024-10-28T14:57:06Z

I merged a fix for this, thanks for reporting ! It will now retry on any requests Timeout error, including ReadTimeoutError: #7256

mariosasko mentioned this issue Apr 26, 2024

Retry on HF Hub error when streaming #6844

Closed

lhoestq mentioned this issue Oct 28, 2024

Retry all requests timeouts #7256

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterableDataset raises exception instead of retrying #6843

IterableDataset raises exception instead of retrying #6843

bauwenst commented Apr 26, 2024 •

edited

Loading

mariosasko commented Apr 26, 2024

bauwenst commented Apr 26, 2024 •

edited

Loading

mariosasko commented Apr 26, 2024

bauwenst commented Apr 26, 2024 •

edited

Loading

haydn-jones commented Apr 30, 2024

bauwenst commented Oct 26, 2024

lhoestq commented Oct 28, 2024

IterableDataset raises exception instead of retrying #6843

IterableDataset raises exception instead of retrying #6843

Comments

bauwenst commented Apr 26, 2024 • edited Loading

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Apr 26, 2024

bauwenst commented Apr 26, 2024 • edited Loading

mariosasko commented Apr 26, 2024

bauwenst commented Apr 26, 2024 • edited Loading

haydn-jones commented Apr 30, 2024

bauwenst commented Oct 26, 2024

lhoestq commented Oct 28, 2024

bauwenst commented Apr 26, 2024 •

edited

Loading

bauwenst commented Apr 26, 2024 •

edited

Loading

bauwenst commented Apr 26, 2024 •

edited

Loading