Skip to content

Commit

Permalink
Merge remote-tracking branch 'upstream/advanced-dag' into dag-execute
Browse files Browse the repository at this point in the history
  • Loading branch information
andylizf committed Nov 3, 2024
2 parents 059ddf4 + 3f60b07 commit 9e0892d
Show file tree
Hide file tree
Showing 119 changed files with 3,124 additions and 1,795 deletions.
24 changes: 9 additions & 15 deletions Dockerfile_k8s
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM continuumio/miniconda3:23.3.1-0
FROM --platform=linux/amd64 continuumio/miniconda3:23.3.1-0

# TODO(romilb): Investigate if this image can be consolidated with the skypilot
# client image (`Dockerfile`)
Expand Down Expand Up @@ -33,21 +33,15 @@ ENV HOME /home/sky
# Set current working directory
WORKDIR /home/sky

# Install SkyPilot pip dependencies preemptively to speed up provisioning time
RUN conda init && \
pip install wheel Click colorama cryptography jinja2 jsonschema networkx \
oauth2client pandas pendulum PrettyTable rich tabulate filelock packaging \
'protobuf<4.0.0' pulp pycryptodome==3.12.0 docker kubernetes==28.1.0 \
grpcio==1.51.3 python-dotenv==1.0.1 ray[default]==2.9.3 && \
# Install skypilot dependencies
RUN conda init && export PIP_DISABLE_PIP_VERSION_CHECK=1 && \
python3 -m venv ~/skypilot-runtime && \
PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \
$PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \
$PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \
curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Add /home/sky/.local/bin/ to PATH
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc

# Copy SkyPilot code base. This is required for the ssh jump pod to find the
# lifecycle management scripts
COPY --chown=sky . /skypilot/sky/
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \
echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc

# Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
ENV PYTHONUNBUFFERED=1
19 changes: 7 additions & 12 deletions Dockerfile_k8s_gpu
Original file line number Diff line number Diff line change
Expand Up @@ -41,19 +41,14 @@ RUN curl https://repo.anaconda.com/miniconda/Miniconda3-py310_23.11.0-2-Linux-x8
eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true && conda activate base && \
grep "# >>> conda initialize >>>" ~/.bashrc || { conda init && source ~/.bashrc; } && \
rm Miniconda3-Linux-x86_64.sh && \
pip install wheel Click colorama cryptography jinja2 jsonschema networkx \
oauth2client pandas pendulum PrettyTable rich tabulate filelock packaging \
'protobuf<4.0.0' pulp pycryptodome==3.12.0 docker kubernetes==28.1.0 \
grpcio==1.51.3 python-dotenv==1.0.1 ray[default]==2.9.3 && \
export PIP_DISABLE_PIP_VERSION_CHECK=1 && \
python3 -m venv ~/skypilot-runtime && \
PYTHON_EXEC=$(echo ~/skypilot-runtime)/bin/python && \
$PYTHON_EXEC -m pip install 'skypilot-nightly[remote,kubernetes]' 'ray[default]==2.9.3' 'pycryptodome==3.12.0' && \
$PYTHON_EXEC -m pip uninstall skypilot-nightly -y && \
curl -LO "https://dl.k8s.io/release/v1.28.11/bin/linux/amd64/kubectl" && \
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# Add /home/sky/.local/bin/ to PATH
RUN echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc

# Copy SkyPilot code base. This is required for the ssh jump pod to find the
# lifecycle management scripts
COPY --chown=sky . /skypilot/sky/
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl && \
echo 'export PATH="$PATH:$HOME/.local/bin"' >> ~/.bashrc

# Set PYTHONUNBUFFERED=1 to have Python print to stdout/stderr immediately
ENV PYTHONUNBUFFERED=1
106 changes: 67 additions & 39 deletions docs/source/examples/managed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,43 +5,30 @@ Managed Jobs

.. tip::

This feature is great for scaling out: running a single job for long durations, or running many jobs (pipelines).
This feature is great for scaling out: running a single job for long durations, or running many jobs in parallel.

SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any spot preemptions or hardware failures.
It can be used in three modes:
SkyPilot supports **managed jobs** (:code:`sky jobs`), which can automatically recover from any underlying spot preemptions or hardware failures.
Managed jobs can be used in three modes:

#. :ref:`Managed Spot Jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This can **save significant costs** (e.g., up to 70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
#. :ref:`On-demand <on-demand>`: Jobs run on auto-recovering on-demand instances. This is useful for jobs that require guaranteed resources.
#. :ref:`Pipelines <pipeline>`: Run pipelines that contain multiple tasks (which can have different resource requirements and ``setup``/``run`` commands). This is useful for running a sequence of tasks that depend on each other, e.g., data processing, training a model, and then running inference on it.
#. :ref:`Managed spot jobs <spot-jobs>`: Jobs run on auto-recovering spot instances. This **saves significant costs** (e.g., ~70\% for GPU VMs) by making preemptible spot instances useful for long-running jobs.
#. :ref:`Managed on-demand/reserved jobs <on-demand>`: Jobs run on auto-recovering on-demand or reserved instances. Useful for jobs that require guaranteed resources.
#. :ref:`Managed pipelines <pipeline>`: Run pipelines that contain multiple tasks (which
can have different resource requirements and ``setup``/``run`` commands).
Useful for running a sequence of tasks that depend on each other, e.g., data
processing, training a model, and then running inference on it.


.. _spot-jobs:

Managed Spot Jobs
-----------------

In this mode, :code:`sky jobs launch --use-spot` is used to launch a managed spot job. SkyPilot automatically finds available spot resources across regions and clouds to maximize availability.
Any spot preemptions are automatically handled by SkyPilot without user intervention.
In this mode, jobs run on spot instances, and preemptions are auto-recovered by SkyPilot.

To launch a managed spot job, use :code:`sky jobs launch --use-spot`.
SkyPilot automatically finds available spot instances across regions and clouds to maximize availability.
Any spot preemptions are automatically handled by SkyPilot without user intervention.

Quick comparison between *unmanaged spot clusters* vs. *managed spot jobs*:

.. list-table::
:widths: 30 18 12 35
:header-rows: 1

* - Command
- Managed?
- SSH-able?
- Best for
* - :code:`sky launch --use-spot`
- Unmanaged spot cluster
- Yes
- Interactive dev on spot instances (especially for hardware with low preemption rates)
* - :code:`sky jobs launch --use-spot`
- Managed spot job (auto-recovery)
- No
- Scaling out long-running jobs (e.g., data processing, training, batch inference)

Here is an example of a BERT training job failing over different regions across AWS and GCP.

Expand All @@ -59,6 +46,25 @@ To use managed spot jobs, there are two requirements:
#. :ref:`Checkpointing <checkpointing>` (optional): For job recovery due to preemptions, the user application code can checkpoint its progress periodically to a :ref:`mounted cloud bucket <sky-storage>`. The program can reload the latest checkpoint when restarted.


Quick comparison between *managed spot jobs* vs. *launching spot clusters*:

.. list-table::
:widths: 30 18 12 35
:header-rows: 1

* - Command
- Managed?
- SSH-able?
- Best for
* - :code:`sky jobs launch --use-spot`
- Yes, preemptions are auto-recovered
- No
- Scaling out long-running jobs (e.g., data processing, training, batch inference)
* - :code:`sky launch --use-spot`
- No, preemptions are not handled
- Yes
- Interactive dev on spot instances (especially for hardware with low preemption rates)

.. _job-yaml:

Job YAML
Expand Down Expand Up @@ -93,7 +99,7 @@ We can launch it with the following:
setup: |
# Fill in your wandb key: copy from https://wandb.ai/authorize
# Alternatively, you can use `--env WANDB_API_KEY=$WANDB_API_KEY`
# to pass the key in the command line, during `sky spot launch`.
# to pass the key in the command line, during `sky jobs launch`.
echo export WANDB_API_KEY=[YOUR-WANDB-API-KEY] >> ~/.bashrc
pip install -e .
Expand Down Expand Up @@ -245,11 +251,11 @@ Real-World Examples

.. _on-demand:

Using On-Demand Instances
--------------------------------
Managed On-Demand/Reserved Jobs
-------------------------------

The same ``sky jobs launch`` and YAML interfaces can run jobs on auto-recovering
on-demand instances. This is useful to have SkyPilot monitor any underlying
on-demand or reserved instances. This is useful to have SkyPilot monitor any underlying
machine failures and transparently recover the job.

To do so, simply set :code:`use_spot: false` in the :code:`resources` section, or override it with :code:`--use-spot false` in the CLI.
Expand All @@ -264,10 +270,10 @@ To do so, simply set :code:`use_spot: false` in the :code:`resources` section, o
interface, while ``sky launch`` is a cluster interface (that you can launch
tasks on, albeit not managed).

Either Spot Or On-Demand
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Either Spot or On-Demand/Reserved
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can use ``any_of`` to specify either spot or on-demand instances as
You can use ``any_of`` to specify either spot or on-demand/reserved instances as
candidate resources for a job. See documentation :ref:`here
<multiple-resources>` for more details.

Expand All @@ -280,12 +286,35 @@ candidate resources for a job. See documentation :ref:`here
- use_spot: false
In this example, SkyPilot will perform cost optimizations to select the resource to use, which almost certainly
will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand instances.
will be spot instances. If spot instances are not available, SkyPilot will fall back to launch on-demand/reserved instances.


Jobs Restarts on User Code Failure
-----------------------------------

By default, SkyPilot will try to recover a job when its underlying cluster is preempted or failed. Any user code failures (non-zero exit codes) are not auto-recovered.

In some cases, you may want a job to automatically restart on its own failures, e.g., when a training job crashes due to a Nvidia driver issue or NCCL timeouts. To specify this, you
can set :code:`max_restarts_on_errors` in :code:`resources.job_recovery` in the job YAML file.

.. code-block:: yaml
resources:
accelerators: A100:8
job_recovery:
# Restart the job up to 3 times on user code errors.
max_restarts_on_errors: 3
More advanced policies for resource selection, such as the `Can't Be Late
<https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao>`__ (NSDI'24)
paper, may be supported in the future.

Running Many Parallel Jobs
--------------------------

For batch jobs such as **data processing** or **hyperparameter sweeps**, you can launch many jobs in parallel. See :ref:`many-jobs`.

Useful CLIs
-----------

Expand Down Expand Up @@ -323,11 +352,10 @@ Cancel a managed job:
If any failure happens for a managed job, you can check :code:`sky jobs queue -a` for the brief reason
of the failure. For more details, it would be helpful to check :code:`sky jobs logs --controller <job_id>`.


.. _pipeline:

Job Pipelines
-------------
Managed Pipelines
-----------------

A pipeline is a managed job that contains a sequence of tasks running one after another.

Expand Down Expand Up @@ -414,8 +442,8 @@ To submit the pipeline, the same command :code:`sky jobs launch` is used. The pi



Dashboard
---------
Job Dashboard
-------------

Use ``sky jobs dashboard`` to open a dashboard to see all jobs:

Expand Down
18 changes: 17 additions & 1 deletion docs/source/reference/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ How to ensure my workdir's ``.git`` is synced up for managed spot jobs?
Currently, there is a difference in whether ``.git`` is synced up depending on the command used:

- For regular ``sky launch``, the workdir's ``.git`` is synced up by default.
- For managed spot jobs ``sky spot launch``, the workdir's ``.git`` is excluded by default.
- For managed jobs ``sky jobs launch``, the workdir's ``.git`` is excluded by default.

In the second case, to ensure the workdir's ``.git`` is synced up for managed spot jobs, you can explicitly add a file mount to sync it up:

Expand Down Expand Up @@ -192,6 +192,22 @@ For example, if you have access to special regions of GCP, add the data to ``~/.
Also, you can update the catalog for a specific cloud by deleting the CSV file (e.g., ``rm ~/.sky/catalogs/<schema-version>/gcp.csv``).
SkyPilot will automatically download the latest catalog in the next run.

Package Installation
---------------------

Unable to import PyTorch in a SkyPilot task.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For `PyTorch <https://pytorch.org/>`_ installation, if you are using the default SkyPilot images (not passing in `--image-id`), ``pip install torch`` should work.

But if you use your own image which has an older NVIDIA driver (535.161.08 or lower) and you install the default PyTorch, you may encounter the following error:

.. code-block:: bash
ImportError: /home/azureuser/miniconda3/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12
You will need to install a PyTorch version that is compatible with your NVIDIA driver, e.g., ``pip install torch --index-url https://download.pytorch.org/whl/cu121``.


Miscellaneous
-------------

Expand Down
20 changes: 15 additions & 5 deletions docs/source/reference/kubernetes/kubernetes-deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,10 +147,16 @@ Deploying on Google Cloud GKE
.. code-block:: console
$ sky show-gpus --cloud kubernetes
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 3, 4 8 6
A100 1, 2 4 2
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 8 6
A100 1, 2 4 2
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 L4 4 4
my-cluster-1 L4 4 2
my-cluster-2 A100 2 2
my-cluster-3 A100 2 0
.. note::
GKE autopilot clusters are currently not supported. Only GKE standard clusters are supported.
Expand Down Expand Up @@ -196,8 +202,12 @@ Deploying on Amazon EKS
.. code-block:: console
$ sky show-gpus --cloud kubernetes
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
A100 1, 2 4 2
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
A100 1, 2 4 2
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
my-cluster-0 A100 2 2
.. _kubernetes-setup-onprem:

Expand Down
13 changes: 9 additions & 4 deletions docs/source/reference/kubernetes/kubernetes-getting-started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -156,9 +156,9 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show
$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
Expand All @@ -174,7 +174,12 @@ You can also inspect the real-time GPU usage on the cluster with :code:`sky show

Using Custom Images
-------------------
By default, we use and maintain a SkyPilot container image that has conda and a few other basic tools installed.
By default, we maintain and use two SkyPilot container images for use on Kubernetes clusters:

1. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot``: used for CPU-only clusters (`Dockerfile <https://github.com/skypilot-org/skypilot/blob/master/Dockerfile_k8s>`__).
2. ``us-central1-docker.pkg.dev/skypilot-375900/skypilotk8s/skypilot-gpu``: used for GPU clusters (`Dockerfile <https://github.com/skypilot-org/skypilot/blob/master/Dockerfile_k8s_gpu>`__).

These images are pre-installed with SkyPilot dependencies for fast startup.

To use your own image, add :code:`image_id: docker:<your image tag>` to the :code:`resources` section of your task YAML.

Expand Down
6 changes: 3 additions & 3 deletions docs/source/reference/kubernetes/kubernetes-setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -262,9 +262,9 @@ You can also check the GPUs available on your nodes by running:
$ sky show-gpus --cloud kubernetes
Kubernetes GPUs
GPU QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
L4 1, 2, 4 12 12
H100 1, 2, 4, 8 16 16
Kubernetes per node GPU availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
Expand Down
Loading

0 comments on commit 9e0892d

Please sign in to comment.