Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] On-demand single-host TPU support on GKE #3947

Open
wants to merge 70 commits into
base: master
Choose a base branch
from

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Sep 16, 2024

One of our users requested a feature to use spot TPU from GKE. This is an intial step to support the request for on-demand single host TPU.

This PR does not contain the support for:

  • multi-host TPU support
  • autoscaler support
  • spot TPU support

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Using GKE cluster with 2 single host TPU podslice of 1x1 topology and 2 CPU instances.
      • sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice --num-nodes 2 -y
      • sky launch --cloud kubernetes --cpus=2 -y
      • sky show-gpus --cloud kubernetes
    • Using GKE cluster with 1 single host TPU podslice of 2x2 topology and 2 CPU instances.
      • sky launch tpu_gke.yaml --cloud kubernetes --gpus tpu-v5-lite-podslice:4 -y
  • All smoke tests: pytest tests/test_smoke.py besides the ones also failing from master branch:
    • test_managed_jobs_storage
    • test_multiple_accelerators_unordered_with_default
    • test_skyserve_user_bug_restart
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
    • pytest tests/test_smoke.py::test_tpu_pod_slice_gke --kubernetes
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

tpu_gke.yaml:

file_mounts:
  /result:
    store: gcs
    name: tpu-mount-test-dy
    mode: MOUNT

setup: |
  git clone https://github.com/google/flax.git --branch v0.8.2

  conda activate flax
  if [ $? -eq 0 ]; then
    echo 'conda env exists'
  else
    conda create -n flax python=3.10 -y
    conda activate flax
    # Make sure to install TPU related packages in a conda env to avoid package conflicts.
    pip install \
      -f https://storage.googleapis.com/jax-releases/libtpu_releases.html "jax[tpu]==0.4.25" \
      clu \
      tensorflow tensorflow-datasets
    pip install -e flax
  fi

run: |
  conda activate flax
  pip install clu
  cd flax/examples/mnist
  python3 main.py --workdir=/tmp/mnist \
    --config=configs/default.py \
    --config.learning_rate=0.05 \
    --config.num_epochs=10 >> /result/output.log 2>&1

@landscapepainter landscapepainter changed the title [k8s] TPU support on GKE [k8s] on-demand TPU support on GKE Sep 16, 2024
@landscapepainter landscapepainter marked this pull request as draft September 16, 2024 09:25
@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Oct 23, 2024

@cblmemo

Also, the following command should fail before the provisioning happens?

So there are two cases where lack of resources fails to launch:

  1. When there's no node in the cluster that meets the requirement for the resource requested
    --> This case, Skypilot fails before attempting to provision with early check using check_instance_fits.
  2. When there's a node in the cluster that meets the requirement for the resource requested, but it is already in use, so not available.
    --> This is failed while provisioning as we rely on the scheduling log to detect this scenario.

The case you mentioned seems to be an edge case of 1&2. There's a node that meets the requirement(1x1), but it didn't fit check early that it actually requires 2 nodes of 1x1, so it does the while provisioning. This behavior is currently consistent with other non TPU resource request as you can see below.

$ sky launch --cloud kubernetes --cpus 16 --num-nodes 8
Considered resources (8 nodes):
------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                  COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   16CPU--16GB   16      16        -              gke_skypilot-375900_us-south1-a_mix-tpu-dy   0.00          ✔
------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-7410-gcpuser'. Proceed? [Y/n]: y
⚙︎ Launching on Kubernetes.
W 10-23 02:33:43 instance.py:704] run_instances: Error occurred when creating pods: sky.provision.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Required resources (cpu=16, memory=17179869184, nvidia.com/gpu=0) were not found in a single node. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`. Full error: 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) had untolerated taint {google.com/tpu: present}. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in gke_skypilot-375900_us-south1-a_mix-tpu-dy for {Kubernetes(cpus=16)}.

↺ Trying other potential resources.
⨯ Failed to provision resources. View logs at: ~/sky_logs/sky-2024-10-23-02-33-24-615729/provision.log

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 8x Kubernetes(cpus=16)
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

@landscapepainter landscapepainter added this to the v0.7 milestone Oct 23, 2024
@landscapepainter
Copy link
Collaborator Author

@cblmemo @romilbhardwaj Successfully re-ran the smoke tests for both TPU clusters and GPU clusters.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @landscapepainter ! Mostly looks good to me and testing now 🫡 Figured it is better to send those comments first.

sky/cli.py Outdated
Comment on lines 3183 to 3190
# TODO(Doyoung): Update the message with the multi-host TPU
# support.
k8s_per_node_acc_message = (
'Kubernetes per node accelerator availability ')
if kubernetes_utils.multi_host_tpu_exists_in_cluster(
context):
k8s_per_node_acc_message += (
'(Note: Multi-host TPUs are not supported.)')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO(Doyoung): Update the message with the multi-host TPU
# support.
k8s_per_node_acc_message = (
'Kubernetes per node accelerator availability ')
if kubernetes_utils.multi_host_tpu_exists_in_cluster(
context):
k8s_per_node_acc_message += (
'(Note: Multi-host TPUs are not supported.)')
# TODO(Doyoung): Update the message with the multi-host TPU
# support.
maybe_tpu_multi_host_hint = ''
if kubernetes_utils.multi_host_tpu_exists_in_cluster(
context):
maybe_tpu_multi_host_hint = f'Detected {xxx} node...'

should we say sth like detected xxx node that is using multi-host gpu, skip showing them?

Copy link
Collaborator Author

@landscapepainter landscapepainter Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo I'm not convinced on why this is needed on top of the minimal noting message I already added for the following reasons:

  1. There can be multiple number of multi-host TPUs in user's GKE cluster. If there's like 10 of them, your suggestion would list out all of them. Not sure if this is the best UX as we are trying to keep things concise. If it's important info, we should add them, but..
  2. wondering if this is necessary to begin with since the users of TPUs on GKE would know what a multi-host TPU is and its existance in the cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that the message Multi-host TPUs are not supported. does not convey the meaning of "we excluded some nodes from your cluster", which might confuse the users.

One way is to count the number of nodes / or number of TPUs and show xxx nodes with multi-host TPU setup is not included / excluded in the resrouces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo I see. That makes sense and I agree to the concern. I extended a message so that the users are noted that the Multi-host TPU nodes in their GKE cluster are excluded from the display.

Kubernetes per node accelerator availability (Note: Multi-host TPUs are detected and excluded from the display as multi-host TPUs are not supported.)
NODE_NAME                                  GPU_NAME              TOTAL_GPUS  FREE_GPUS
gke-mix-tpu-dy-default-pool-ad5bdc4d-9lw4  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-bs86  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-cfxn  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-nr5x  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-qgjt  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-rl37  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-v4ts  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-zp2x  None                  0           0
gke-tpu-a3716138-984x                      tpu-v5-lite-podslice  4           0
gke-tpu-c5117ac4-qfzt                      tpu-v5-lite-podslice  1           1

sky/cli.py Show resolved Hide resolved
sky/templates/kubernetes-ray.yml.j2 Show resolved Hide resolved
Comment on lines 421 to 425
{% if tpu_requested %}
google.com/tpu: {{accelerator_count}}
{% else %}
nvidia.com/gpu: {{accelerator_count}}
{% endif %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump this again

sky/templates/kubernetes-ray.yml.j2 Outdated Show resolved Hide resolved
Comment on lines 724 to 735
if node_metadata_labels.get(
label_formatter.TPU_LABEL_KEY) == acc_type:
topology_label_key = (
label_formatter.TPU_TOPOLOGY_LABEL_KEY)
topology_value = node_metadata_labels.get(
topology_label_key)
assert topology_value is not None
tpu_topology_chip_count = reduce_tpu_topology(
topology_value)
if tpu_topology_chip_count == acc_count:
return (label, value, topology_label_key,
topology_value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have conflict with 4x1 and 2x2 topology?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo Impossible, different topologies with identical number of TPU chips from a TPU instance is not available.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on the "is not available"? Does that means there is a one-to-one mapping from number of cores to its topology? If it is the case, a comment plus reference to some doc page would be helpful :))

Copy link
Collaborator Author

@landscapepainter landscapepainter Nov 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo I was not able to find a document that clarifies this, but you can observe this for single host TPUs by seeing the topology options at GCP console. There's a one-to-one mapping from available topology and number of TPU chips. Left a comment.

sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
sky/provision/kubernetes/utils.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Oct 28, 2024

I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the accelerators: tpu-v2-8 in the example yaml resolved the issue for me, but it seems we have an error to infer the cloud for resources if cli and task YAML has some inconsistencies. Could you take a look at what is happening here?

# Create a cluster
$ sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4
# Launch a task on it, manually override with --gpus
$ sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml
Task from YAML spec: /home/txia/skypilot/examples/tpu/tpuvm_mnist.yaml
Missing runtime_version in accelerator_args, using default (tpu-vm-base)
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    {1x GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'})}
  Existing:     1x Kubernetes(2CPU--8GB--4tpu-v5-lite-podslice, {'tpu-v5-lite-podslice': 4}, accelerator_args={})
To fix: specify a new cluster name, or down the existing cluster first: sky down gke-tpu-4

sky/resources.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator

cblmemo commented Oct 28, 2024

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]

Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

(tpuvm_mnist, pid=3325) Requirement already satisfied: clu in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (0.0.12)
(tpuvm_mnist, pid=3325) Requirement already satisfied: absl-py in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (2.1.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: etils[epath] in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (1.10.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: flax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.8.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: jax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.4.25)
(tpuvm_mnist, pid=3325) Requirement already satisfied: jaxlib in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.4.25)
(tpuvm_mnist, pid=3325) Requirement already satisfied: ml-collections in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.1.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: numpy in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (2.0.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: packaging in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (24.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: typing-extensions in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (4.12.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: wrapt in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (1.16.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: fsspec in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (2024.10.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: importlib_resources in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (6.4.5)
(tpuvm_mnist, pid=3325) Requirement already satisfied: zipp in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (3.20.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: msgpack in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (1.1.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: optax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.2.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: orbax-checkpoint in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.5.18)
(tpuvm_mnist, pid=3325) Requirement already satisfied: tensorstore in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.1.67)
(tpuvm_mnist, pid=3325) Requirement already satisfied: rich>=11.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (13.9.3)
(tpuvm_mnist, pid=3325) Requirement already satisfied: PyYAML>=5.4.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (6.0.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: ml-dtypes>=0.2.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (0.4.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: opt-einsum in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (3.4.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: scipy>=1.9 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (1.14.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: six in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from ml-collections->clu) (1.16.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: contextlib2 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from ml-collections->clu) (21.6.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: markdown-it-py>=2.2.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from rich>=11.1->flax->clu) (3.0.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from rich>=11.1->flax->clu) (2.18.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: chex>=0.1.86 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from optax->flax->clu) (0.1.86)
(tpuvm_mnist, pid=3325) Requirement already satisfied: nest_asyncio in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from orbax-checkpoint->flax->clu) (1.6.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: protobuf in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from orbax-checkpoint->flax->clu) (3.20.3)
(tpuvm_mnist, pid=3325) Requirement already satisfied: toolz>=0.9.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from chex>=0.1.86->optax->flax->clu) (1.0.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: mdurl~=0.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax->clu) (0.1.2)
(tpuvm_mnist, pid=3325) 
(tpuvm_mnist, pid=3325) A module that was compiled using NumPy 1.x cannot be run in
(tpuvm_mnist, pid=3325) NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
(tpuvm_mnist, pid=3325) versions of NumPy, modules must be compiled with NumPy 2.0.
(tpuvm_mnist, pid=3325) Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
(tpuvm_mnist, pid=3325) 
(tpuvm_mnist, pid=3325) If you are a user of the module, the easiest solution will be to
(tpuvm_mnist, pid=3325) downgrade to 'numpy<2' or try to upgrade the affected module.
(tpuvm_mnist, pid=3325) We expect that some modules will need time to support NumPy 2.
(tpuvm_mnist, pid=3325) 
(tpuvm_mnist, pid=3325) Traceback (most recent call last):  File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 25, in <module>
(tpuvm_mnist, pid=3325)     import jax
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/__init__.py", line 37, in <module>
(tpuvm_mnist, pid=3325)     import jax.core as _core
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/core.py", line 18, in <module>
(tpuvm_mnist, pid=3325)     from jax._src.core import (
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/core.py", line 38, in <module>
(tpuvm_mnist, pid=3325)     from jax._src import dtypes
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/dtypes.py", line 33, in <module>
(tpuvm_mnist, pid=3325)     from jax._src import config
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/config.py", line 27, in <module>
(tpuvm_mnist, pid=3325)     from jax._src import lib
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/lib/__init__.py", line 87, in <module>
(tpuvm_mnist, pid=3325)     import jaxlib.xla_client as xla_client
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jaxlib/xla_client.py", line 32, in <module>
(tpuvm_mnist, pid=3325)     from . import xla_extension as _xla
(tpuvm_mnist, pid=3325) AttributeError: _ARRAY_API not found
(tpuvm_mnist, pid=3325) 2024-10-28 20:47:33.959159: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(tpuvm_mnist, pid=3325) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
(tpuvm_mnist, pid=3325) E0000 00:00:1730148453.973354    3798 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(tpuvm_mnist, pid=3325) E0000 00:00:1730148453.977656    3798 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(tpuvm_mnist, pid=3325) 2024-10-28 20:47:35.939498: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
(tpuvm_mnist, pid=3325) I1028 20:47:41.511022 138348441096832 main.py:51] JAX process: 0 / 1
(tpuvm_mnist, pid=3325) I1028 20:47:41.511178 138348441096832 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
(tpuvm_mnist, pid=3325) I1028 20:47:41.511398 138348441096832 local.py:45] Setting task status: process_index: 0, process_count: 1
(tpuvm_mnist, pid=3325) I1028 20:47:41.511511 138348441096832 local.py:50] Created artifact workdir of type ArtifactType.DIRECTORY and value /tmp/mnist.
(tpuvm_mnist, pid=3325) I1028 20:47:42.905525 138348441096832 dataset_info.py:805] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: mnist/3.0.1
(tpuvm_mnist, pid=3325) I1028 20:47:43.162729 138348441096832 dataset_info.py:617] Load dataset info from /tmp/tmpfaoy19getfds
(tpuvm_mnist, pid=3325) I1028 20:47:43.164796 138348441096832 dataset_info.py:709] For 'mnist/3.0.1': fields info.[citation, splits, supervised_keys, module_name] differ on disk and in the code. Keeping the one from code.
(tpuvm_mnist, pid=3325) I1028 20:47:43.165012 138348441096832 dataset_builder.py:644] Generating dataset mnist (/home/sky/tensorflow_datasets/mnist/3.0.1)
(tpuvm_mnist, pid=3325) Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /home/sky/tensorflow_datasets/mnist/3.0.1...
(tpuvm_mnist, pid=3325) I1028 20:47:43.289096 138348441096832 dataset_builder.py:693] Dataset mnist is hosted on GCS. It will automatically be downloaded to your
(tpuvm_mnist, pid=3325) local data directory. If you'd instead prefer to read directly from our public
(tpuvm_mnist, pid=3325) GCS bucket (recommended if you're running on GCP), you can instead pass
(tpuvm_mnist, pid=3325) `try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.
(tpuvm_mnist, pid=3325) 
Dl Completed...: 100%|██████████| 5/5 [00:00<00:00, 14.90 file/s]:00<00:00,  4.84 file/s]
(tpuvm_mnist, pid=3325) I1028 20:47:43.678423 138348441096832 dataset_info.py:617] Load dataset info from /home/sky/tensorflow_datasets/mnist/incomplete.1RCFAI_3.0.1/
(tpuvm_mnist, pid=3325) I1028 20:47:43.679890 138348441096832 dataset_info.py:709] For 'mnist/3.0.1': fields info.[citation, splits, supervised_keys, module_name, file_format] differ on disk and in the code. Keeping the one from code.
(tpuvm_mnist, pid=3325) Dataset mnist downloaded and prepared to /home/sky/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
(tpuvm_mnist, pid=3325) I1028 20:47:43.681097 138348441096832 reader.py:261] Creating a tf.data.Dataset reading 1 files located in folders: /home/sky/tensorflow_datasets/mnist/3.0.1.
(tpuvm_mnist, pid=3325) I1028 20:47:44.686084 138348441096832 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train, from /home/sky/tensorflow_datasets/mnist/3.0.1
(tpuvm_mnist, pid=3325) I1028 20:47:44.686940 138348441096832 reader.py:261] Creating a tf.data.Dataset reading 1 files located in folders: /home/sky/tensorflow_datasets/mnist/3.0.1.
(tpuvm_mnist, pid=3325) I1028 20:47:44.888163 138348441096832 logging_logger.py:49] Constructing tf.data.Dataset mnist for split test, from /home/sky/tensorflow_datasets/mnist/3.0.1
(tpuvm_mnist, pid=3325) Traceback (most recent call last):
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 69, in <module>
(tpuvm_mnist, pid=3325)     app.run(main)
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/absl/app.py", line 308, in run
(tpuvm_mnist, pid=3325)     _run_main(main, args)
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
(tpuvm_mnist, pid=3325)     sys.exit(main(argv))
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 64, in main
(tpuvm_mnist, pid=3325)     train.train_and_evaluate(FLAGS.config, FLAGS.workdir)
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/train.py", line 130, in train_and_evaluate
(tpuvm_mnist, pid=3325)     train_ds, test_ds = get_datasets()
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/train.py", line 105, in get_datasets
(tpuvm_mnist, pid=3325)     train_ds['image'] = jnp.float32(train_ds['image']) / 255.0
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 152, in __call__
(tpuvm_mnist, pid=3325)     return asarray(x, dtype=self.dtype)
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 2233, in asarray
(tpuvm_mnist, pid=3325)     return array(a, dtype=dtype, copy=bool(copy), order=order)  # type: ignore
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 2174, in array
(tpuvm_mnist, pid=3325)     out = np.array(object, dtype=dtype, ndmin=ndmin, copy=False)  # type: ignore[arg-type]
(tpuvm_mnist, pid=3325) ValueError: Unable to avoid copy while creating an array as requested.
(tpuvm_mnist, pid=3325) If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
(tpuvm_mnist, pid=3325) For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ERROR: Job 1 failed with return code list: [1] 
✓ Job finished (status: FAILED).

@cblmemo
Copy link
Collaborator

cblmemo commented Oct 28, 2024

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available).

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Should we show similar fuzzy result like this?

$ sky launch --gpus A100:3                             
No resource satisfying <Cloud>({'A100': 3}) on [Kubernetes, Lambda, GCP, Azure, AWS, RunPod].
Did you mean: ['A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:4', 'A100-80GB:8', 'A100:16', 'A100:4', 'A100:8']
sky.exceptions.ResourcesUnavailableError: Catalog and kubernetes cluster does not contain any instances satisfying the request: 1x <Cloud>({'A100': 3}).
To fix: relax or change the resource requirements.
Try one of these offered accelerators: ['A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:4', 'A100-80GB:8', 'A100:16', 'A100:4', 'A100:8']

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

@romilbhardwaj romilbhardwaj removed this from the v0.7 milestone Oct 29, 2024
@landscapepainter
Copy link
Collaborator Author

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

@landscapepainter
Copy link
Collaborator Author

I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the accelerators: tpu-v2-8 in the example yaml resolved the issue for me, but it seems we have an error to infer the cloud for resources if cli and task YAML has some inconsistencies. Could you take a look at what is happening here?

@cblmemo sky/task.py::Task.set_resources_override is setting new_resources to be GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'}), which does not exist. And this results in the issue you are encountering. Seems like overriding resource should update the cloud type from GCP to Kubernetes as well, but such logic doesn't seem to exist. Do we currently allow this from Skypilot?

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 2, 2024

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

@cblmemo
Copy link
Collaborator

cblmemo commented Nov 2, 2024

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.
(sky-serve) ➜  skypilot git:(new_provision_api) ✗ sky launch @temp/a.yaml --instance-type n2-standard-8
Task from YAML spec: @temp/a.yaml
ValueError: Invalid instance type 'n2-standard-8' for cloud AWS.
(sky-serve) ➜  skypilot git:(new_provision_api) ✗ cat @temp/a.yaml
resources:
  cloud: aws

At least we should show such error information? Current error is a little bit confusing to me..

Also, current conflict is two auto-filled cloud's conflict. If the user explicitly set the cloud in the YAML and cause a conflict, that sounds reasonable to me. But I would be surprised if I didn't set the cloud but two of SkyPilot's auto inference of cloud is causing conflict..

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Nov 2, 2024

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4.
sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml does not work due to the issue mentioned here, but specifying the cloud to be kubernetes works now with: sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml:

$ sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml

...

(tpuvm_mnist, pid=2036) I1102 22:44:01.655345 136981829304960 train.py:148] epoch:  7, train_loss: 0.0167, train_accuracy: 99.47, test_loss: 0.0266, test_accuracy: 99.18
(tpuvm_mnist, pid=2036) I1102 22:44:03.135087 136981829304960 train.py:148] epoch:  8, train_loss: 0.0134, train_accuracy: 99.58, test_loss: 0.0260, test_accuracy: 99.16
(tpuvm_mnist, pid=2036) I1102 22:44:04.615064 136981829304960 train.py:148] epoch:  9, train_loss: 0.0117, train_accuracy: 99.65, test_loss: 0.0248, test_accuracy: 99.21
(tpuvm_mnist, pid=2036) I1102 22:44:06.100036 136981829304960 train.py:148] epoch: 10, train_loss: 0.0086, train_accuracy: 99.75, test_loss: 0.0268, test_accuracy: 99.14
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:		sky cancel gke-tpu-4 1
├── To stream job logs:		sky logs gke-tpu-4 1
└── To view job queue:		sky queue gke-tpu-4

Cluster name: gke-tpu-4
├── To log into the head VM:	ssh gke-tpu-4
├── To submit a job:		sky exec gke-tpu-4 yaml_file
├── To stop the cluster:	sky stop gke-tpu-4
└── To teardown the cluster:	sky down gke-tpu-4
Tip: `sky down` will delete launched TPU(s) too.

Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:

Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> print(jax.devices());
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0)]

@landscapepainter
Copy link
Collaborator Author

landscapepainter commented Nov 3, 2024

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

I just tested this out again by specifying --cloud aws, and it's exactly the same what kubernetes error displayed from your end. And we get the error you see,

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

, because skypilot knows that tpu-v5-lite-podslice is only available in kubernetes, unlike A100.

So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix?

@landscapepainter
Copy link
Collaborator Author

@cblmemo @romilbhardwaj This is ready for another round. Thanks!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants