Merge remote-tracking branch 'upstream/advanced-dag' into dag-execute

skypilot-org · Nov 3, 2024 · 9e0892d · 9e0892d
2 parents 059ddf4 + 3f60b07
commit 9e0892d
Show file tree

Hide file tree

Showing 119 changed files with 3,124 additions and 1,795 deletions.
diff --git a/Dockerfile_k8s b/Dockerfile_k8s
@@ -1,4 +1,4 @@
-FROM continuumio/miniconda3:23.3.1-0
+FROM --platform=linux/amd64 continuumio/miniconda3:23.3.1-0
 
 # TODO(romilb): Investigate if this image can be consolidated with the skypilot
 #  client image (`Dockerfile`)
@@ -33,21 +33,15 @@ ENV HOME /home/sky
 # Set current working directory
 WORKDIR /home/sky
 
-# Install SkyPilot pip dependencies preemptively to speed up provisioning time
-RUN conda init && \
-    pip install wheel Click colorama cryptography jinja2 jsonschema networkx  \
-    oauth2client pandas pendulum PrettyTable rich tabulate filelock packaging  \
-    'protobuf<4.0.0' pulp pycryptodome==3.12.0 docker kubernetes==28.1.0  \
-    grpcio==1.51.3 python-dotenv==1.0.1 ray[default]==2.9.3 && \
+# Install skypilot dependencies
+RUN conda init && export PIP_DISABLE_PIP_VERSION_CHECK=1 && \
+    python3 -m venv ~/skypilot-runtime && \
+    PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \
+    $PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \
+    $PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \
     curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \
-    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
-
-# Add /home/sky/.local/bin/ to PATH
-RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
-
-# Copy SkyPilot code base. This is required for the ssh jump pod to find the
-# lifecycle management scripts
-COPY --chown=sky . /skypilot/sky/
+    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \
+    echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
 
 # Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
 ENV PYTHONUNBUFFERED=1
diff --git a/Dockerfile_k8s_gpu b/Dockerfile_k8s_gpu
@@ -41,19 +41,14 @@ RUN curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x8
     eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true && conda activate base && \
     grep "# >>> conda initialize >>>" ~/.bashrc || { conda init && source ~/.bashrc; } && \
     rm Miniconda3-Linux-x86_64.sh && \
-    pip install wheel Click colorama cryptography jinja2 jsonschema networkx  \
-    oauth2client pandas pendulum PrettyTable rich tabulate filelock packaging  \
-    'protobuf<4.0.0' pulp pycryptodome==3.12.0 docker kubernetes==28.1.0  \
-    grpcio==1.51.3 python-dotenv==1.0.1 ray[default]==2.9.3 && \
+    export PIP_DISABLE_PIP_VERSION_CHECK=1 && \
+    python3 -m venv ~/skypilot-runtime && \
+    PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \
+    $PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \
+    $PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \
     curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \
-    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
-
-# Add /home/sky/.local/bin/ to PATH
-RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
-
-# Copy SkyPilot code base. This is required for the ssh jump pod to find the
-# lifecycle management scripts
-COPY --chown=sky . /skypilot/sky/
+    sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \
+    echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc
 
 # Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
 ENV PYTHONUNBUFFERED=1
diff --git a/docs/source/examples/managed-jobs.rst b/docs/source/examples/managed-jobs.rst
@@ -5,43 +5,30 @@ Managed Jobs
 
 .. tip::
 
-  This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines).
+  This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel.
 
-SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures.
-It can be used in three modes:
+SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any underlying spot preemptions or hardware failures.
+Managed jobs can be used in three modes:
 
-#. :ref:`Managed Spot Jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
-#. :ref:`On-demand <on-demand>`: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources.
-#. :ref:`Pipelines <pipeline>`: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it.
+#. :ref:`Managed spot jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
+#. :ref:`Managed on-demand/reserved jobs <on-demand>`: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources.
+#. :ref:`Managed pipelines <pipeline>`: Run pipelines that contain multiple tasks (which
+   can have different resource requirements and ``setup``/``run`` commands).
+   Useful for running a sequence of tasks that depend on each other, e.g., data
+   processing, training a model, and then running inference on it.
 
 
 .. _spot-jobs:
 
 Managed Spot Jobs
 -----------------
 
-In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
-Any spot preemptions are automatically handled by SkyPilot without user intervention.
+In this mode, jobs run on spot instances, and preemptions are auto-recovered by SkyPilot.
 
+To launch a managed spot job, use :code:`sky jobs launch --use-spot`.
+SkyPilot automatically finds available spot instances across regions and clouds to maximize availability.
+Any spot preemptions are automatically handled by SkyPilot without user intervention.
 
-Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*:
-
-.. list-table::
-   :widths: 30 18 12 35
-   :header-rows: 1
-
-   * - Command
-     - Managed?
-     - SSH-able?
-     - Best for
-   * - :code:`sky launch --use-spot`
-     - Unmanaged spot cluster
-     - Yes
-     - Interactive dev on spot instances (especially for hardware with low preemption rates)
-   * - :code:`sky jobs launch --use-spot`
-     - Managed spot job (auto-recovery)
-     - No
-     - Scaling out long-running jobs (e.g., data processing, training, batch inference)
 
 Here is an example of a BERT training job failing over different regions across AWS and GCP.
 
@@ -59,6 +46,25 @@ To use managed spot jobs, there are two requirements:
 #. :ref:`Checkpointing <checkpointing>` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket <sky-storage>`. The program can reload the latest checkpoint when restarted.
 
 
+Quick comparison between *managed spot jobs* vs. *launching spot clusters*:
+
+.. list-table::
+   :widths: 30 18 12 35
+   :header-rows: 1
+
+   * - Command
+     - Managed?
+     - SSH-able?
+     - Best for
+   * - :code:`sky jobs launch --use-spot`
+     - Yes, preemptions are auto-recovered
+     - No
+     - Scaling out long-running jobs (e.g., data processing, training, batch inference)
+   * - :code:`sky launch --use-spot`
+     - No, preemptions are not handled
+     - Yes
+     - Interactive dev on spot instances (especially for hardware with low preemption rates)
+
 .. _job-yaml:
 
 Job YAML
@@ -93,7 +99,7 @@ We can launch it with the following:
   setup: |
     # Fill in your wandb key: copy from https://wandb.ai/authorize
     # Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
-    # to pass the key in the command line, during `sky spot launch`.
+    # to pass the key in the command line, during `sky jobs launch`.
     echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
 
     pip install -e .
@@ -245,11 +251,11 @@ Real-World Examples
 
 .. _on-demand:
 
-Using On-Demand Instances
---------------------------------
+Managed On-Demand/Reserved Jobs
+-------------------------------
 
 The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering
-on-demand instances. This is useful to have SkyPilot monitor any underlying
+on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying
 machine failures and transparently recover the job.
 
 To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI.
@@ -264,10 +270,10 @@ To do so, simply set :code:`use_spot: false` in the :code:`resources` section, o
   interface, while ``sky launch`` is a cluster interface (that you can launch
   tasks on, albeit not managed).
 
-Either Spot Or On-Demand
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Either Spot or On-Demand/Reserved
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-You can use ``any_of`` to specify either spot or on-demand instances as
+You can use ``any_of`` to specify either spot or on-demand/reserved instances as
 candidate resources for a job. See documentation :ref:`here
 <multiple-resources>` for more details.
 
@@ -280,12 +286,35 @@ candidate resources for a job. See documentation :ref:`here
       - use_spot: false
 
 In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly
-will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances.
+will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand/reserved instances.
+
+
+Jobs Restarts on User Code Failure
+-----------------------------------
+
+By default, SkyPilot will try to recover a job when its underlying cluster is preempted or failed. Any user code failures (non-zero exit codes) are not auto-recovered.
+
+In some cases, you may want a job to automatically restart on its own failures, e.g., when a training job crashes due to a Nvidia driver issue or NCCL timeouts. To specify this, you
+can set :code:`max_restarts_on_errors` in :code:`resources.job_recovery` in the job YAML file.
+
+.. code-block:: yaml
+
+  resources:
+    accelerators: A100:8
+    job_recovery:
+      # Restart the job up to 3 times on user code errors.
+      max_restarts_on_errors: 3
+
 
 More advanced policies for resource selection, such as the `Can't Be Late
 <https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao>`__ (NSDI'24)
 paper, may be supported in the future.
 
+Running Many Parallel Jobs
+--------------------------
+
+For batch jobs such as **data processing** or **hyperparameter sweeps**, you can launch many jobs in parallel. See :ref:`many-jobs`.
+
 Useful CLIs
 -----------
 
@@ -323,11 +352,10 @@ Cancel a managed job:
   If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason
   of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller <job_id>`.
 
-
 .. _pipeline:
 
-Job Pipelines
--------------
+Managed Pipelines
+-----------------
 
 A pipeline is a managed job that contains a sequence of tasks running one after another.
 
@@ -414,8 +442,8 @@ To submit the pipeline, the same command :code:`sky jobs launch` is used. The pi
 
 
 
-Dashboard
----------
+Job Dashboard
+-------------
 
 Use ``sky jobs dashboard`` to open a dashboard to see all jobs:
 

diff --git a/docs/source/reference/faq.rst b/docs/source/reference/faq.rst
@@ -38,7 +38,7 @@ How to ensure my workdir's ``.git`` is synced up for managed spot jobs?
 Currently, there is a difference in whether ``.git`` is synced up depending on the command used:
 
 - For regular ``sky launch``, the workdir's ``.git`` is synced up by default.
-- For managed spot jobs ``sky spot launch``, the workdir's ``.git`` is excluded by default.
+- For managed jobs ``sky jobs launch``, the workdir's ``.git`` is excluded by default.
 
 In the second case, to ensure the workdir's ``.git`` is synced up for managed spot jobs, you can explicitly add a file mount to sync it up:
 
@@ -192,6 +192,22 @@ For example, if you have access to special regions of GCP, add the data to ``~/.
 Also, you can update the catalog for a specific cloud by deleting the CSV file (e.g., ``rm ~/.sky/catalogs/<schema-version>/gcp.csv``).
 SkyPilot will automatically download the latest catalog in the next run.
 
+Package Installation
+---------------------
+
+Unable to import PyTorch in a SkyPilot task.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+For `PyTorch <https://pytorch.org/>`_ installation, if you are using the default SkyPilot images (not passing in `--image-id`), ``pip install torch`` should work.
+
+But if you use your own image which has an older NVIDIA driver (535.161.08 or lower) and you install the default PyTorch, you may encounter the following error:
+
+.. code-block:: bash
+
+  ImportError: /home/azureuser/miniconda3/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
+
+You will need to install a PyTorch version that is compatible with your NVIDIA driver, e.g., ``pip install torch --index-url https://download.pytorch.org/whl/cu121``.
+
+
 Miscellaneous
 -------------
 

diff --git a/docs/source/reference/kubernetes/kubernetes-deployment.rst b/docs/source/reference/kubernetes/kubernetes-deployment.rst
@@ -147,10 +147,16 @@ Deploying on Google Cloud GKE
    .. code-block:: console
 
        $ sky show-gpus --cloud kubernetes
-       GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-       L4    1, 2, 3, 4    8           6
-       A100  1, 2          4           2
+       GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+       L4    1, 2, 4                   8           6
+       A100  1, 2                      4           2
 
+       Kubernetes per node GPU availability
+       NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
+       my-cluster-0               L4        4           4
+       my-cluster-1               L4        4           2
+       my-cluster-2               A100      2           2
+       my-cluster-3               A100      2           0
 
 .. note::
     GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.
@@ -196,8 +202,12 @@ Deploying on Amazon EKS
    .. code-block:: console
 
        $ sky show-gpus --cloud kubernetes
-       GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-       A100  1, 2          4           2
+       GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+       A100  1, 2                      4           2
+
+       Kubernetes per node GPU availability
+       NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
+       my-cluster-0               A100      2           2
 
 .. _kubernetes-setup-onprem:
 

diff --git a/docs/source/reference/kubernetes/kubernetes-getting-started.rst b/docs/source/reference/kubernetes/kubernetes-getting-started.rst
@@ -156,9 +156,9 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show
 
     $ sky show-gpus --cloud kubernetes
     Kubernetes GPUs
-    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-    L4    1, 2, 4       12          12
-    H100  1, 2, 4, 8    16          16
+    GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+    L4    1, 2, 4                   12          12
+    H100  1, 2, 4, 8                16          16
 
     Kubernetes per node GPU availability
     NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS
@@ -174,7 +174,12 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show
 
 Using Custom Images
 -------------------
-By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed.
+By default, we maintain and use two SkyPilot container images for use on Kubernetes clusters:
+
+1. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot``: used for CPU-only clusters (`Dockerfile <https://github.com/skypilot-org/skypilot/blob/master/Dockerfile_k8s>`__).
+2. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu``: used for GPU clusters (`Dockerfile <https://github.com/skypilot-org/skypilot/blob/master/Dockerfile_k8s_gpu>`__).
+
+These images are pre-installed with SkyPilot dependencies for fast startup.
 
 To use your own image, add :code:`image_id: docker:<your image tag>` to the :code:`resources` section of your task YAML.
 

diff --git a/docs/source/reference/kubernetes/kubernetes-setup.rst b/docs/source/reference/kubernetes/kubernetes-setup.rst
@@ -262,9 +262,9 @@ You can also check the GPUs available on your nodes by running:
 
     $ sky show-gpus --cloud kubernetes
     Kubernetes GPUs
-    GPU   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
-    L4    1, 2, 4       12          12
-    H100  1, 2, 4, 8    16          16
+    GPU   REQUESTABLE_QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
+    L4    1, 2, 4                   12          12
+    H100  1, 2, 4, 8                16          16
 
     Kubernetes per node GPU availability
     NODE_NAME                  GPU_NAME  TOTAL_GPUS  FREE_GPUS