From 087414b236f5e2140e1e01f6f91d450fa3d5eb1a Mon Sep 17 00:00:00 2001 From: Zhanghao Wu Date: Sun, 27 Oct 2024 22:10:34 -0700 Subject: [PATCH] Update docs/source/examples/managed-jobs.rst Co-authored-by: Romil Bhardwaj --- docs/source/examples/managed-jobs.rst | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst index 41535a1fa66..71acd8b0125 100644 --- a/docs/source/examples/managed-jobs.rst +++ b/docs/source/examples/managed-jobs.rst @@ -289,8 +289,7 @@ Jobs Restarts on User Code Failure By default, SkyPilot will try to recover a job when its underlying cluster is preempted or failed. Any user code failures (non-zero exit codes) are not auto-recovered. In some cases, you may want a job to automatically restart on its own failures, e.g., when a training job crashes due to a Nvidia driver issue or NCCL timeouts. To specify this, you -can further set :code:`max_restarts_on_failure` in :code:`resources.job_recovery` in the job YAML file. - +can set :code:`max_restarts_on_failure` in :code:`resources.job_recovery` in the job YAML file. .. code-block:: yaml resources: