Skip to content

Commit

Permalink
Merge branch 'jobs-max-retry-on-failure' of github.com:skypilot-org/s…
Browse files Browse the repository at this point in the history
…kypilot into jobs-max-retry-on-failure
  • Loading branch information
Michaelvll committed Oct 28, 2024
2 parents da26fc1 + 087414b commit bea7fe0
Showing 1 changed file with 1 addition and 2 deletions.
3 changes: 1 addition & 2 deletions docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -289,8 +289,7 @@ Jobs Restarts on User Code Failure
By default, SkyPilot will try to recover a job when its underlying cluster is preempted or failed. Any user code failures (non-zero exit codes) are not auto-recovered.

In some cases, you may want a job to automatically restart on its own failures, e.g., when a training job crashes due to a Nvidia driver issue or NCCL timeouts. To specify this, you
can further set :code:`max_restarts_on_failure` in :code:`resources.job_recovery` in the job YAML file.

can set :code:`max_restarts_on_failure` in :code:`resources.job_recovery` in the job YAML file.
.. code-block:: yaml
resources:
Expand Down

0 comments on commit bea7fe0

Please sign in to comment.