[k8s] On-demand single-host TPU support on GKE #3947

landscapepainter · 2024-09-16T09:12:54Z

One of our users requested a feature to use spot TPU from GKE. This is an intial step to support the request for on-demand single host TPU.

This PR does not contain the support for:

multi-host TPU support
autoscaler support
spot TPU support

Tested (run the relevant ones):

tpu_gke.yaml:

file_mounts:
  /result:
    store: gcs
    name: tpu-mount-test-dy
    mode: MOUNT

setup: |
  git clone https://github.com/google/flax.git --branch v0.8.2

  conda activate flax
  if [ $? -eq 0 ]; then
    echo 'conda env exists'
  else
    conda create -n flax python=3.10 -y
    conda activate flax
    # Make sure to install TPU related packages in a conda env to avoid package conflicts.
    pip install \
      -f https://storage.googleapis.com/jax-releases/libtpu_releases.html "jax[tpu]==0.4.25" \
      clu \
      tensorflow tensorflow-datasets
    pip install -e flax
  fi

run: |
  conda activate flax
  pip install clu
  cd flax/examples/mnist
  python3 main.py --workdir=/tmp/mnist \
    --config=configs/default.py \
    --config.learning_rate=0.05 \
    --config.num_epochs=10 >> /result/output.log 2>&1

…com/gpu and google.com/tpu

landscapepainter · 2024-10-23T02:39:57Z

@cblmemo

Also, the following command should fail before the provisioning happens?

So there are two cases where lack of resources fails to launch:

When there's no node in the cluster that meets the requirement for the resource requested
--> This case, Skypilot fails before attempting to provision with early check using check_instance_fits.
When there's a node in the cluster that meets the requirement for the resource requested, but it is already in use, so not available.
--> This is failed while provisioning as we rely on the scheduling log to detect this scenario.

The case you mentioned seems to be an edge case of 1&2. There's a node that meets the requirement(1x1), but it didn't fit check early that it actually requires 2 nodes of 1x1, so it does the while provisioning. This behavior is currently consistent with other non TPU resource request as you can see below.

$ sky launch --cloud kubernetes --cpus 16 --num-nodes 8
Considered resources (8 nodes):
------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                  COST ($)   CHOSEN
------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   16CPU--16GB   16      16        -              gke_skypilot-375900_us-south1-a_mix-tpu-dy   0.00          ✔
------------------------------------------------------------------------------------------------------------------------------
Launching a new cluster 'sky-7410-gcpuser'. Proceed? [Y/n]: y
⚙︎ Launching on Kubernetes.
W 10-23 02:33:43 instance.py:704] run_instances: Error occurred when creating pods: sky.provision.kubernetes.config.KubernetesError: Insufficient CPU capacity on the cluster. Required resources (cpu=16, memory=17179869184, nvidia.com/gpu=0) were not found in a single node. Other SkyPilot tasks or pods may be using resources. Check resource usage by running `kubectl describe nodes`. Full error: 0/6 nodes are available: 2 Insufficient cpu, 4 node(s) had untolerated taint {google.com/tpu: present}. preemption: 0/6 nodes are available: 2 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling.
sky.exceptions.ResourcesUnavailableError: Failed to acquire resources in all zones in gke_skypilot-375900_us-south1-a_mix-tpu-dy for {Kubernetes(cpus=16)}.

↺ Trying other potential resources.
⨯ Failed to provision resources. View logs at: ~/sky_logs/sky-2024-10-23-02-33-24-615729/provision.log

sky.exceptions.ResourcesUnavailableError: Failed to provision all possible launchable resources. Relax the task's resource requirements: 8x Kubernetes(cpus=16)
To keep retrying until the cluster is up, use the `--retry-until-up` flag.

landscapepainter · 2024-10-25T07:57:25Z

@cblmemo @romilbhardwaj Successfully re-ran the smoke tests for both TPU clusters and GPU clusters.

cblmemo

Thanks for the fix @landscapepainter ! Mostly looks good to me and testing now 🫡 Figured it is better to send those comments first.

cblmemo · 2024-10-21T18:00:15Z

sky/cli.py

+                    # TODO(Doyoung): Update the message with the multi-host TPU
+                    # support.
+                    k8s_per_node_acc_message = (
+                        'Kubernetes per node accelerator availability ')
+                    if kubernetes_utils.multi_host_tpu_exists_in_cluster(
+                            context):
+                        k8s_per_node_acc_message += (
+                            '(Note: Multi-host TPUs are not supported.)')


Suggested change

# TODO(Doyoung): Update the message with the multi-host TPU

# support.

k8s_per_node_acc_message = (

'Kubernetes per node accelerator availability ')

if kubernetes_utils.multi_host_tpu_exists_in_cluster(

context):

k8s_per_node_acc_message += (

'(Note: Multi-host TPUs are not supported.)')

# TODO(Doyoung): Update the message with the multi-host TPU

# support.

maybe_tpu_multi_host_hint = ''

if kubernetes_utils.multi_host_tpu_exists_in_cluster(

context):

maybe_tpu_multi_host_hint = f'Detected {xxx} node...'

should we say sth like detected xxx node that is using multi-host gpu, skip showing them?

@cblmemo I'm not convinced on why this is needed on top of the minimal noting message I already added for the following reasons:

There can be multiple number of multi-host TPUs in user's GKE cluster. If there's like 10 of them, your suggestion would list out all of them. Not sure if this is the best UX as we are trying to keep things concise. If it's important info, we should add them, but..

wondering if this is necessary to begin with since the users of TPUs on GKE would know what a multi-host TPU is and its existance in the cluster.

My main concern here is that the message Multi-host TPUs are not supported. does not convey the meaning of "we excluded some nodes from your cluster", which might confuse the users.

One way is to count the number of nodes / or number of TPUs and show xxx nodes with multi-host TPU setup is not included / excluded in the resrouces.

@cblmemo I see. That makes sense and I agree to the concern. I extended a message so that the users are noted that the Multi-host TPU nodes in their GKE cluster are excluded from the display.

Kubernetes per node accelerator availability (Note: Multi-host TPUs are detected and excluded from the display as multi-host TPUs are not supported.) NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS gke-mix-tpu-dy-default-pool-ad5bdc4d-9lw4 None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-bs86 None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-cfxn None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-nr5x None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-qgjt None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-rl37 None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-v4ts None 0 0 gke-mix-tpu-dy-default-pool-ad5bdc4d-zp2x None 0 0 gke-tpu-a3716138-984x tpu-v5-lite-podslice 4 0 gke-tpu-c5117ac4-qfzt tpu-v5-lite-podslice 1 1

sky/cli.py

sky/templates/kubernetes-ray.yml.j2

cblmemo · 2024-10-21T18:19:56Z

sky/templates/kubernetes-ray.yml.j2

+              {% if tpu_requested %}
+              google.com/tpu: {{accelerator_count}}
+              {% else %}
+              nvidia.com/gpu: {{accelerator_count}}
+              {% endif %}


bump this again

sky/templates/kubernetes-ray.yml.j2

cblmemo · 2024-10-28T20:34:17Z

sky/provision/kubernetes/utils.py

+                            if node_metadata_labels.get(
+                                    label_formatter.TPU_LABEL_KEY) == acc_type:
+                                topology_label_key = (
+                                    label_formatter.TPU_TOPOLOGY_LABEL_KEY)
+                                topology_value = node_metadata_labels.get(
+                                    topology_label_key)
+                                assert topology_value is not None
+                                tpu_topology_chip_count = reduce_tpu_topology(
+                                    topology_value)
+                                if tpu_topology_chip_count == acc_count:
+                                    return (label, value, topology_label_key,
+                                            topology_value)


Is it possible to have conflict with 4x1 and 2x2 topology?

@cblmemo Impossible, different topologies with identical number of TPU chips from a TPU instance is not available.

Could you elaborate on the "is not available"? Does that means there is a one-to-one mapping from number of cores to its topology? If it is the case, a comment plus reference to some doc page would be helpful :))

@cblmemo I was not able to find a document that clarifies this, but you can observe this for single host TPUs by seeing the topology options at GCP console. There's a one-to-one mapping from available topology and number of TPU chips. Left a comment.

sky/provision/kubernetes/utils.py

cblmemo · 2024-10-28T20:45:41Z

I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the accelerators: tpu-v2-8 in the example yaml resolved the issue for me, but it seems we have an error to infer the cloud for resources if cli and task YAML has some inconsistencies. Could you take a look at what is happening here?

# Create a cluster
$ sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4
# Launch a task on it, manually override with --gpus
$ sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml
Task from YAML spec: /home/txia/skypilot/examples/tpu/tpuvm_mnist.yaml
Missing runtime_version in accelerator_args, using default (tpu-vm-base)
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    {1x GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'})}
  Existing:     1x Kubernetes(2CPU--8GB--4tpu-v5-lite-podslice, {'tpu-v5-lite-podslice': 4}, accelerator_args={})
To fix: specify a new cluster name, or down the existing cluster first: sky down gke-tpu-4

sky/resources.py

cblmemo · 2024-10-28T20:53:01Z

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]

Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

(tpuvm_mnist, pid=3325) Requirement already satisfied: clu in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (0.0.12)
(tpuvm_mnist, pid=3325) Requirement already satisfied: absl-py in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (2.1.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: etils[epath] in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (1.10.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: flax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.8.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: jax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.4.25)
(tpuvm_mnist, pid=3325) Requirement already satisfied: jaxlib in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.4.25)
(tpuvm_mnist, pid=3325) Requirement already satisfied: ml-collections in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (0.1.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: numpy in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (2.0.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: packaging in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (24.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: typing-extensions in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (4.12.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: wrapt in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from clu) (1.16.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: fsspec in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (2024.10.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: importlib_resources in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (6.4.5)
(tpuvm_mnist, pid=3325) Requirement already satisfied: zipp in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from etils[epath]->clu) (3.20.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: msgpack in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (1.1.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: optax in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.2.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: orbax-checkpoint in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.5.18)
(tpuvm_mnist, pid=3325) Requirement already satisfied: tensorstore in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (0.1.67)
(tpuvm_mnist, pid=3325) Requirement already satisfied: rich>=11.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (13.9.3)
(tpuvm_mnist, pid=3325) Requirement already satisfied: PyYAML>=5.4.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from flax->clu) (6.0.2)
(tpuvm_mnist, pid=3325) Requirement already satisfied: ml-dtypes>=0.2.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (0.4.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: opt-einsum in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (3.4.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: scipy>=1.9 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from jax->clu) (1.14.1)
(tpuvm_mnist, pid=3325) Requirement already satisfied: six in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from ml-collections->clu) (1.16.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: contextlib2 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from ml-collections->clu) (21.6.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: markdown-it-py>=2.2.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from rich>=11.1->flax->clu) (3.0.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from rich>=11.1->flax->clu) (2.18.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: chex>=0.1.86 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from optax->flax->clu) (0.1.86)
(tpuvm_mnist, pid=3325) Requirement already satisfied: nest_asyncio in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from orbax-checkpoint->flax->clu) (1.6.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: protobuf in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from orbax-checkpoint->flax->clu) (3.20.3)
(tpuvm_mnist, pid=3325) Requirement already satisfied: toolz>=0.9.0 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from chex>=0.1.86->optax->flax->clu) (1.0.0)
(tpuvm_mnist, pid=3325) Requirement already satisfied: mdurl~=0.1 in /home/sky/miniconda3/envs/flax/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich>=11.1->flax->clu) (0.1.2)
(tpuvm_mnist, pid=3325) 
(tpuvm_mnist, pid=3325) A module that was compiled using NumPy 1.x cannot be run in
(tpuvm_mnist, pid=3325) NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
(tpuvm_mnist, pid=3325) versions of NumPy, modules must be compiled with NumPy 2.0.
(tpuvm_mnist, pid=3325) Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
(tpuvm_mnist, pid=3325) 
(tpuvm_mnist, pid=3325) If you are a user of the module, the easiest solution will be to
(tpuvm_mnist, pid=3325) downgrade to 'numpy<2' or try to upgrade the affected module.
(tpuvm_mnist, pid=3325) We expect that some modules will need time to support NumPy 2.
(tpuvm_mnist, pid=3325) 
(tpuvm_mnist, pid=3325) Traceback (most recent call last):  File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 25, in <module>
(tpuvm_mnist, pid=3325)     import jax
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/__init__.py", line 37, in <module>
(tpuvm_mnist, pid=3325)     import jax.core as _core
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/core.py", line 18, in <module>
(tpuvm_mnist, pid=3325)     from jax._src.core import (
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/core.py", line 38, in <module>
(tpuvm_mnist, pid=3325)     from jax._src import dtypes
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/dtypes.py", line 33, in <module>
(tpuvm_mnist, pid=3325)     from jax._src import config
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/config.py", line 27, in <module>
(tpuvm_mnist, pid=3325)     from jax._src import lib
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/lib/__init__.py", line 87, in <module>
(tpuvm_mnist, pid=3325)     import jaxlib.xla_client as xla_client
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jaxlib/xla_client.py", line 32, in <module>
(tpuvm_mnist, pid=3325)     from . import xla_extension as _xla
(tpuvm_mnist, pid=3325) AttributeError: _ARRAY_API not found
(tpuvm_mnist, pid=3325) 2024-10-28 20:47:33.959159: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
(tpuvm_mnist, pid=3325) WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
(tpuvm_mnist, pid=3325) E0000 00:00:1730148453.973354    3798 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
(tpuvm_mnist, pid=3325) E0000 00:00:1730148453.977656    3798 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
(tpuvm_mnist, pid=3325) 2024-10-28 20:47:35.939498: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
(tpuvm_mnist, pid=3325) I1028 20:47:41.511022 138348441096832 main.py:51] JAX process: 0 / 1
(tpuvm_mnist, pid=3325) I1028 20:47:41.511178 138348441096832 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
(tpuvm_mnist, pid=3325) I1028 20:47:41.511398 138348441096832 local.py:45] Setting task status: process_index: 0, process_count: 1
(tpuvm_mnist, pid=3325) I1028 20:47:41.511511 138348441096832 local.py:50] Created artifact workdir of type ArtifactType.DIRECTORY and value /tmp/mnist.
(tpuvm_mnist, pid=3325) I1028 20:47:42.905525 138348441096832 dataset_info.py:805] Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: mnist/3.0.1
(tpuvm_mnist, pid=3325) I1028 20:47:43.162729 138348441096832 dataset_info.py:617] Load dataset info from /tmp/tmpfaoy19getfds
(tpuvm_mnist, pid=3325) I1028 20:47:43.164796 138348441096832 dataset_info.py:709] For 'mnist/3.0.1': fields info.[citation, splits, supervised_keys, module_name] differ on disk and in the code. Keeping the one from code.
(tpuvm_mnist, pid=3325) I1028 20:47:43.165012 138348441096832 dataset_builder.py:644] Generating dataset mnist (/home/sky/tensorflow_datasets/mnist/3.0.1)
(tpuvm_mnist, pid=3325) Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /home/sky/tensorflow_datasets/mnist/3.0.1...
(tpuvm_mnist, pid=3325) I1028 20:47:43.289096 138348441096832 dataset_builder.py:693] Dataset mnist is hosted on GCS. It will automatically be downloaded to your
(tpuvm_mnist, pid=3325) local data directory. If you'd instead prefer to read directly from our public
(tpuvm_mnist, pid=3325) GCS bucket (recommended if you're running on GCP), you can instead pass
(tpuvm_mnist, pid=3325) `try_gcs=True` to `tfds.load` or set `data_dir=gs://tfds-data/datasets`.
(tpuvm_mnist, pid=3325) 
Dl Completed...: 100%|██████████| 5/5 [00:00<00:00, 14.90 file/s]:00<00:00,  4.84 file/s]
(tpuvm_mnist, pid=3325) I1028 20:47:43.678423 138348441096832 dataset_info.py:617] Load dataset info from /home/sky/tensorflow_datasets/mnist/incomplete.1RCFAI_3.0.1/
(tpuvm_mnist, pid=3325) I1028 20:47:43.679890 138348441096832 dataset_info.py:709] For 'mnist/3.0.1': fields info.[citation, splits, supervised_keys, module_name, file_format] differ on disk and in the code. Keeping the one from code.
(tpuvm_mnist, pid=3325) Dataset mnist downloaded and prepared to /home/sky/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.
(tpuvm_mnist, pid=3325) I1028 20:47:43.681097 138348441096832 reader.py:261] Creating a tf.data.Dataset reading 1 files located in folders: /home/sky/tensorflow_datasets/mnist/3.0.1.
(tpuvm_mnist, pid=3325) I1028 20:47:44.686084 138348441096832 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train, from /home/sky/tensorflow_datasets/mnist/3.0.1
(tpuvm_mnist, pid=3325) I1028 20:47:44.686940 138348441096832 reader.py:261] Creating a tf.data.Dataset reading 1 files located in folders: /home/sky/tensorflow_datasets/mnist/3.0.1.
(tpuvm_mnist, pid=3325) I1028 20:47:44.888163 138348441096832 logging_logger.py:49] Constructing tf.data.Dataset mnist for split test, from /home/sky/tensorflow_datasets/mnist/3.0.1
(tpuvm_mnist, pid=3325) Traceback (most recent call last):
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 69, in <module>
(tpuvm_mnist, pid=3325)     app.run(main)
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/absl/app.py", line 308, in run
(tpuvm_mnist, pid=3325)     _run_main(main, args)
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
(tpuvm_mnist, pid=3325)     sys.exit(main(argv))
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/main.py", line 64, in main
(tpuvm_mnist, pid=3325)     train.train_and_evaluate(FLAGS.config, FLAGS.workdir)
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/train.py", line 130, in train_and_evaluate
(tpuvm_mnist, pid=3325)     train_ds, test_ds = get_datasets()
(tpuvm_mnist, pid=3325)   File "/home/sky/sky_workdir/flax/examples/mnist/train.py", line 105, in get_datasets
(tpuvm_mnist, pid=3325)     train_ds['image'] = jnp.float32(train_ds['image']) / 255.0
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 152, in __call__
(tpuvm_mnist, pid=3325)     return asarray(x, dtype=self.dtype)
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 2233, in asarray
(tpuvm_mnist, pid=3325)     return array(a, dtype=dtype, copy=bool(copy), order=order)  # type: ignore
(tpuvm_mnist, pid=3325)   File "/home/sky/miniconda3/envs/flax/lib/python3.10/site-packages/jax/_src/numpy/lax_numpy.py", line 2174, in array
(tpuvm_mnist, pid=3325)     out = np.array(object, dtype=dtype, ndmin=ndmin, copy=False)  # type: ignore[arg-type]
(tpuvm_mnist, pid=3325) ValueError: Unable to avoid copy while creating an array as requested.
(tpuvm_mnist, pid=3325) If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
(tpuvm_mnist, pid=3325) For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.
ERROR: Job 1 failed with return code list: [1] 
✓ Job finished (status: FAILED).

cblmemo · 2024-10-28T20:55:38Z

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available).

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Should we show similar fuzzy result like this?

$ sky launch --gpus A100:3                             
No resource satisfying <Cloud>({'A100': 3}) on [Kubernetes, Lambda, GCP, Azure, AWS, RunPod].
Did you mean: ['A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:4', 'A100-80GB:8', 'A100:16', 'A100:4', 'A100:8']
sky.exceptions.ResourcesUnavailableError: Catalog and kubernetes cluster does not contain any instances satisfying the request: 1x <Cloud>({'A100': 3}).
To fix: relax or change the resource requirements.
Try one of these offered accelerators: ['A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:4', 'A100-80GB:8', 'A100:16', 'A100:4', 'A100:8']

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

…painter/skypilot into k8s-tpu-support-on-gke

landscapepainter · 2024-11-01T04:23:28Z

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.

$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

landscapepainter · 2024-11-01T05:40:42Z

I tried to launch a task in existing cluster but failed with the following error. Manually commenting out the accelerators: tpu-v2-8 in the example yaml resolved the issue for me, but it seems we have an error to infer the cloud for resources if cli and task YAML has some inconsistencies. Could you take a look at what is happening here?

@cblmemo sky/task.py::Task.set_resources_override is setting new_resources to be GCP({'tpu-v5-lite-podslice': 4}, accelerator_args={'runtime_version': 'tpu-vm-base'}), which does not exist. And this results in the issue you are encountering. Seems like overriding resource should update the cloud type from GCP to Kubernetes as well, but such logic doesn't seem to exist. Do we currently allow this from Skypilot?

cblmemo · 2024-11-02T20:31:31Z

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.
$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

cblmemo · 2024-11-02T20:41:38Z

Also, when I tries to launch with 2 tpus, it seems the error does not illustrate the real reason (e.g. only tpu:4 and tpu:1 is available). ... Should we show similar fuzzy result like this?

@cblmemo Seems like this is a consistent behavior for Kubernetes in general, and not just for TPU support. And the error is illustrating the real reason, but just not the fuzzy result. Is it supposed to be displaying the fuzzy result when one cloud is specified as well? Anyways, it seems to be out of scope of this PR.
$ sky launch --gpus A100:4  --cloud kubernetes
No resource satisfying Kubernetes({'A100': 4}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'A100': 4}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

(sky-serve) ➜  skypilot git:(new_provision_api) ✗ sky launch @temp/a.yaml --instance-type n2-standard-8
Task from YAML spec: @temp/a.yaml
ValueError: Invalid instance type 'n2-standard-8' for cloud AWS.
(sky-serve) ➜  skypilot git:(new_provision_api) ✗ cat @temp/a.yaml
resources:
  cloud: aws

At least we should show such error information? Current error is a little bit confusing to me..

Also, current conflict is two auto-filled cloud's conflict. If the user explicitly set the cloud in the YAML and cause a conflict, that sounds reasonable to me. But I would be surprised if I didn't set the cloud but two of SkyPilot's auto inference of cloud is causing conflict..

landscapepainter · 2024-11-02T22:42:05Z

When launching with --gpus tpu-v5-lite-podslice:1, it still detects 4 TPUs:

(tpuvm_mnist, pid=3324) I1028 20:50:43.360553 135160242983552 main.py:52] JAX local devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=2, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,1,0), core_on_chip=0)]
Is this expected (e.g. the 4 of them is a single basic schedule unit)? If it is, should we disable users to specify TPUs that is less than 4 (or any other TPU count in one host)?

Also, the above example failed for me. Could you check this as well?

@cblmemo Not sure why that error is printing out info regards to TPU with 4 chips, but it's fixed at 688c0b4.
sky launch --gpus tpu-v5-lite-podslice:4 -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml does not work due to the issue mentioned here, but specifying the cloud to be kubernetes works now with: sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml:

$ sky launch --gpus tpu-v5-lite-podslice:4 --cloud kubernetes -c gke-tpu-4 examples/tpu/tpuvm_mnist.yaml

...

(tpuvm_mnist, pid=2036) I1102 22:44:01.655345 136981829304960 train.py:148] epoch:  7, train_loss: 0.0167, train_accuracy: 99.47, test_loss: 0.0266, test_accuracy: 99.18
(tpuvm_mnist, pid=2036) I1102 22:44:03.135087 136981829304960 train.py:148] epoch:  8, train_loss: 0.0134, train_accuracy: 99.58, test_loss: 0.0260, test_accuracy: 99.16
(tpuvm_mnist, pid=2036) I1102 22:44:04.615064 136981829304960 train.py:148] epoch:  9, train_loss: 0.0117, train_accuracy: 99.65, test_loss: 0.0248, test_accuracy: 99.21
(tpuvm_mnist, pid=2036) I1102 22:44:06.100036 136981829304960 train.py:148] epoch: 10, train_loss: 0.0086, train_accuracy: 99.75, test_loss: 0.0268, test_accuracy: 99.14
✓ Job finished (status: SUCCEEDED).

📋 Useful Commands
Job ID: 1
├── To cancel the job:		sky cancel gke-tpu-4 1
├── To stream job logs:		sky logs gke-tpu-4 1
└── To view job queue:		sky queue gke-tpu-4

Cluster name: gke-tpu-4
├── To log into the head VM:	ssh gke-tpu-4
├── To submit a job:		sky exec gke-tpu-4 yaml_file
├── To stop the cluster:	sky stop gke-tpu-4
└── To teardown the cluster:	sky down gke-tpu-4
Tip: `sky down` will delete launched TPU(s) too.

Also, it does not detect TPU with 4 chips when a pod with 1 TPU chip is provisioned:

Python 3.10.15 (main, Oct  3 2024, 07:27:34) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import jax
>>> print(jax.devices());
[TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0)]

landscapepainter · 2024-11-03T01:17:29Z

Got it. Maybe worth filing an issue for this and implement this elsewhere ;)

I just tested this out again by specifying --cloud aws, and it's exactly the same what kubernetes error displayed from your end. And we get the error you see,

$ sky launch --gpus tpu-v5-lite-podslice:2 -c gke-tpu-2
No resource satisfying Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}) on Kubernetes.
sky.exceptions.ResourcesUnavailableError: Kubernetes cluster does not contain any instances satisfying the request: 1x Kubernetes({'tpu-v5-lite-podslice': 2}, accelerator_args={}).
To fix: relax or change the resource requirements.

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

, because skypilot knows that tpu-v5-lite-podslice is only available in kubernetes, unlike A100.

So the fuzzy error message is supposed to appear only when either when the cloud is not specified but the instance is available in multiple clouds. I guess there isn't anything to file an issue to fix?

landscapepainter · 2024-11-03T01:17:58Z

@cblmemo @romilbhardwaj This is ready for another round. Thanks!!

landscapepainter added 3 commits September 16, 2024 08:59

initial version of TPU support on GKE

a929474

revert unnecesary change

80e1877

revert

70a07ab

landscapepainter changed the title ~~[k8s] TPU support on GKE~~ [k8s] on-demand TPU support on GKE Sep 16, 2024

landscapepainter marked this pull request as draft September 16, 2024 09:25

landscapepainter added 25 commits September 17, 2024 06:19

use TPU_LABEL_KEY constant

0cba9a5

nit

17bcbd8

nit

9233bf5

update detect_gpu_label_formatter() to use match_label_key()

12e62c0

tidy get_gpu_label_key_value

c795fe7

nit

1c895f0

update method name

a8f5b6b

update get_gke_accelerator_name to support TPU

bdb3469

add support for get_label_keys method due to TPU label key

1d2d243

syntax

92f4f38

update get_tpu_topology_label_key_value

2662ec8

nit

58f8ad6

refactor error surfacing methods to have it work with TPU support

1cf82b6

update toleration comment

7b551c9

support listing available TPUs and show-gpus for TPUs

81a05ee

nit

e8764f1

update help message

3497aee

Update /tmp/tpu_logs dir's write permission

724806a

nit

e8d73fe

nit

7ac5036

comment update on TPU resource lackage error handling

4470dbe

Update to use global constant instead of hard coded string of nvidia.…

0860e45

…com/gpu and google.com/tpu

add smoke test and make exec work on TPU pods

35f3c80

update smoke test to check if TPU is reachable.

2b56a9e

add comment

305705c

update sky show-gpus

cbce4d5

landscapepainter added this to the v0.7 milestone Oct 23, 2024

landscapepainter added 3 commits October 25, 2024 04:47

update get_accelerator_label_key_value

241efc0

Merge branch 'master' into k8s-tpu-support-on-gke

61b01d1

format

2fbb4eb

landscapepainter added 2 commits October 26, 2024 20:19

Merge branch 'master' into k8s-tpu-support-on-gke

5dc92f3

format

9e8d53d

cblmemo reviewed Oct 28, 2024

View reviewed changes

sky/resources.py Outdated Show resolved Hide resolved

romilbhardwaj removed this from the v0.7 milestone Oct 29, 2024

landscapepainter added 5 commits November 1, 2024 02:09

Merge branch 'master' into k8s-tpu-support-on-gke

932e073

format

0a0eac2

Merge branch 'k8s-tpu-support-on-gke' of https://github.com/landscape…

3bc95b9

…painter/skypilot into k8s-tpu-support-on-gke

update comment

9dbaa72

resolve review comments

f5e1d37

update tpuvm_mnist.yaml

688c0b4

resolve comments

2dec7f9

update display message for show-gpus

dc23e88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[k8s] On-demand single-host TPU support on GKE #3947

[k8s] On-demand single-host TPU support on GKE #3947

landscapepainter commented Sep 16, 2024 •

edited

Loading

landscapepainter commented Oct 23, 2024 •

edited

Loading

landscapepainter commented Oct 25, 2024

cblmemo left a comment

cblmemo Oct 21, 2024

landscapepainter Nov 1, 2024 •

edited

Loading

cblmemo Nov 2, 2024

landscapepainter Nov 3, 2024

cblmemo Oct 21, 2024

cblmemo Oct 28, 2024

landscapepainter Nov 1, 2024

cblmemo Nov 2, 2024

landscapepainter Nov 3, 2024 •

edited

Loading

cblmemo commented Oct 28, 2024

cblmemo commented Oct 28, 2024

cblmemo commented Oct 28, 2024

landscapepainter commented Nov 1, 2024

landscapepainter commented Nov 1, 2024

cblmemo commented Nov 2, 2024

cblmemo commented Nov 2, 2024

landscapepainter commented Nov 2, 2024 •

edited

Loading

landscapepainter commented Nov 3, 2024 •

edited

Loading

landscapepainter commented Nov 3, 2024

[k8s] On-demand single-host TPU support on GKE #3947

Are you sure you want to change the base?

[k8s] On-demand single-host TPU support on GKE #3947

Conversation

landscapepainter commented Sep 16, 2024 • edited Loading

landscapepainter commented Oct 23, 2024 • edited Loading

landscapepainter commented Oct 25, 2024

cblmemo left a comment

Choose a reason for hiding this comment

cblmemo Oct 21, 2024

Choose a reason for hiding this comment

landscapepainter Nov 1, 2024 • edited Loading

Choose a reason for hiding this comment

cblmemo Nov 2, 2024

Choose a reason for hiding this comment

landscapepainter Nov 3, 2024

Choose a reason for hiding this comment

cblmemo Oct 21, 2024

Choose a reason for hiding this comment

cblmemo Oct 28, 2024

Choose a reason for hiding this comment

landscapepainter Nov 1, 2024

Choose a reason for hiding this comment

cblmemo Nov 2, 2024

Choose a reason for hiding this comment

landscapepainter Nov 3, 2024 • edited Loading

Choose a reason for hiding this comment

cblmemo commented Oct 28, 2024

cblmemo commented Oct 28, 2024

cblmemo commented Oct 28, 2024

landscapepainter commented Nov 1, 2024

landscapepainter commented Nov 1, 2024

cblmemo commented Nov 2, 2024

cblmemo commented Nov 2, 2024

landscapepainter commented Nov 2, 2024 • edited Loading

landscapepainter commented Nov 3, 2024 • edited Loading

landscapepainter commented Nov 3, 2024

landscapepainter commented Sep 16, 2024 •

edited

Loading

landscapepainter commented Oct 23, 2024 •

edited

Loading

landscapepainter Nov 1, 2024 •

edited

Loading

landscapepainter Nov 3, 2024 •

edited

Loading

landscapepainter commented Nov 2, 2024 •

edited

Loading

landscapepainter commented Nov 3, 2024 •

edited

Loading