Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] On-demand single-host TPU support on GKE #3947

Open
wants to merge 70 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
a929474
initial version of TPU support on GKE
landscapepainter Sep 16, 2024
80e1877
revert unnecesary change
landscapepainter Sep 16, 2024
70a07ab
revert
landscapepainter Sep 16, 2024
0cba9a5
use TPU_LABEL_KEY constant
landscapepainter Sep 17, 2024
17bcbd8
nit
landscapepainter Sep 17, 2024
9233bf5
nit
landscapepainter Sep 17, 2024
12e62c0
update detect_gpu_label_formatter() to use match_label_key()
landscapepainter Sep 17, 2024
c795fe7
tidy get_gpu_label_key_value
landscapepainter Sep 17, 2024
1c895f0
nit
landscapepainter Sep 17, 2024
a8f5b6b
update method name
landscapepainter Sep 17, 2024
bdb3469
update get_gke_accelerator_name to support TPU
landscapepainter Sep 17, 2024
1d2d243
add support for get_label_keys method due to TPU label key
landscapepainter Sep 17, 2024
92f4f38
syntax
landscapepainter Sep 17, 2024
2662ec8
update get_tpu_topology_label_key_value
landscapepainter Sep 17, 2024
58f8ad6
nit
landscapepainter Sep 20, 2024
1cf82b6
refactor error surfacing methods to have it work with TPU support
landscapepainter Sep 20, 2024
7b551c9
update toleration comment
landscapepainter Sep 21, 2024
81a05ee
support listing available TPUs and show-gpus for TPUs
landscapepainter Sep 21, 2024
e8764f1
nit
landscapepainter Sep 21, 2024
3497aee
update help message
landscapepainter Sep 21, 2024
724806a
Update /tmp/tpu_logs dir's write permission
landscapepainter Sep 22, 2024
e8d73fe
nit
landscapepainter Sep 22, 2024
7ac5036
nit
landscapepainter Sep 22, 2024
4470dbe
comment update on TPU resource lackage error handling
landscapepainter Sep 22, 2024
0860e45
Update to use global constant instead of hard coded string of nvidia.…
landscapepainter Sep 22, 2024
35f3c80
add smoke test and make exec work on TPU pods
landscapepainter Sep 23, 2024
2b56a9e
update smoke test to check if TPU is reachable.
landscapepainter Sep 24, 2024
305705c
add comment
landscapepainter Sep 24, 2024
c2b5bfc
nit
landscapepainter Sep 24, 2024
2ba5537
Comment on number of requested TPU chips for multi- and single- host …
landscapepainter Sep 24, 2024
92cd77d
update method to check GKE supported TPU name
landscapepainter Sep 24, 2024
d085a5b
nit
landscapepainter Sep 24, 2024
7860679
move is_tpu_pod_slice to kubernetes_utils
landscapepainter Sep 25, 2024
96924a7
update get_accelerator_from_label_value to use is_tpu_pod_slice method
landscapepainter Sep 25, 2024
1bbac21
nit
landscapepainter Sep 25, 2024
4f7ea03
format
landscapepainter Sep 25, 2024
16b6c29
nit
landscapepainter Sep 25, 2024
ad5089f
Merge branch 'master' of https://github.com/landscapepainter/skypilot
landscapepainter Sep 26, 2024
aa8efc3
Merge branch 'master' into k8s-tpu-support-on-gke
landscapepainter Sep 26, 2024
e390843
check acc count support
landscapepainter Oct 18, 2024
884f0a2
preemptive TPU check
landscapepainter Oct 18, 2024
ee28466
Merge branch 'master' into k8s-tpu-support-on-gke
landscapepainter Oct 19, 2024
11142e5
update check_tpu_fits
landscapepainter Oct 19, 2024
de55663
error msg update
landscapepainter Oct 19, 2024
a500555
merge get_tpu_topology_label_key_value into get_gpu_label_key_value
landscapepainter Oct 19, 2024
bce8731
Update sky/provision/kubernetes/utils.py
landscapepainter Oct 19, 2024
0e8366c
nit fixes
landscapepainter Oct 20, 2024
f67ad0f
format
landscapepainter Oct 20, 2024
05c37aa
nit
landscapepainter Oct 20, 2024
06d3879
Implement method for reading acc counts from node/pod object
landscapepainter Oct 20, 2024
9a2046c
assertion update for is_tpu_vm
landscapepainter Oct 20, 2024
62b235f
Exclude multi-host TPUs to displayed from show-gpus
landscapepainter Oct 21, 2024
4db1e63
Notify users that multi-host TPUs are not supported from 'sky show-gpus'
landscapepainter Oct 21, 2024
5923f10
format
landscapepainter Oct 21, 2024
fa2e670
nit
landscapepainter Oct 21, 2024
c1ee117
display warning message from show-gpus conditionally
landscapepainter Oct 21, 2024
cbce4d5
update sky show-gpus
landscapepainter Oct 23, 2024
241efc0
update get_accelerator_label_key_value
landscapepainter Oct 25, 2024
61b01d1
Merge branch 'master' into k8s-tpu-support-on-gke
landscapepainter Oct 25, 2024
2fbb4eb
format
landscapepainter Oct 25, 2024
5dc92f3
Merge branch 'master' into k8s-tpu-support-on-gke
landscapepainter Oct 26, 2024
9e8d53d
format
landscapepainter Oct 26, 2024
932e073
Merge branch 'master' into k8s-tpu-support-on-gke
landscapepainter Nov 1, 2024
0a0eac2
format
landscapepainter Nov 1, 2024
3bc95b9
Merge branch 'k8s-tpu-support-on-gke' of https://github.com/landscape…
landscapepainter Nov 1, 2024
9dbaa72
update comment
landscapepainter Nov 1, 2024
f5e1d37
resolve review comments
landscapepainter Nov 1, 2024
688c0b4
update tpuvm_mnist.yaml
landscapepainter Nov 2, 2024
2dec7f9
resolve comments
landscapepainter Nov 3, 2024
dc23e88
update display message for show-gpus
landscapepainter Nov 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 14 additions & 5 deletions sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3106,7 +3106,8 @@ def _get_kubernetes_realtime_gpu_table(
'in Kubernetes cluster. ')
debug_msg = ('To show available accelerators on kubernetes,'
' run: sky show-gpus --cloud kubernetes ')
full_err_msg = (err_msg + kubernetes_utils.NO_GPU_HELP_MESSAGE +
full_err_msg = (err_msg +
kubernetes_utils.NO_ACCELERATOR_HELP_MESSAGE +
debug_msg)
raise ValueError(full_err_msg)
for gpu, _ in sorted(counts.items()):
Expand All @@ -3123,9 +3124,9 @@ def _get_kubernetes_node_info_table(context: Optional[str]):
node_info_dict = kubernetes_utils.get_kubernetes_node_info(context)
for node_name, node_info in node_info_dict.items():
node_table.add_row([
node_name, node_info.gpu_type,
node_info.total['nvidia.com/gpu'],
node_info.free['nvidia.com/gpu']
node_name, node_info.accelerator_type,
node_info.total['accelerator_count'],
node_info.free['accelerators_available']
])
return node_table

Expand Down Expand Up @@ -3179,8 +3180,16 @@ def _output():
yield from k8s_realtime_table.get_string()
k8s_node_table = _get_kubernetes_node_info_table(context)
yield '\n\n'
# TODO(Doyoung): Update the message with the multi-host TPU
# support.
k8s_per_node_acc_message = (
'Kubernetes per node accelerator availability ')
if kubernetes_utils.multi_host_tpu_exists_in_cluster(
context):
k8s_per_node_acc_message += (
'(Note: Multi-host TPUs are not supported.)')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO(Doyoung): Update the message with the multi-host TPU
# support.
k8s_per_node_acc_message = (
'Kubernetes per node accelerator availability ')
if kubernetes_utils.multi_host_tpu_exists_in_cluster(
context):
k8s_per_node_acc_message += (
'(Note: Multi-host TPUs are not supported.)')
# TODO(Doyoung): Update the message with the multi-host TPU
# support.
maybe_tpu_multi_host_hint = ''
if kubernetes_utils.multi_host_tpu_exists_in_cluster(
context):
maybe_tpu_multi_host_hint = f'Detected {xxx} node...'

should we say sth like detected xxx node that is using multi-host gpu, skip showing them?

Copy link
Collaborator Author

@landscapepainter landscapepainter Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo I'm not convinced on why this is needed on top of the minimal noting message I already added for the following reasons:

  1. There can be multiple number of multi-host TPUs in user's GKE cluster. If there's like 10 of them, your suggestion would list out all of them. Not sure if this is the best UX as we are trying to keep things concise. If it's important info, we should add them, but..
  2. wondering if this is necessary to begin with since the users of TPUs on GKE would know what a multi-host TPU is and its existance in the cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is that the message Multi-host TPUs are not supported. does not convey the meaning of "we excluded some nodes from your cluster", which might confuse the users.

One way is to count the number of nodes / or number of TPUs and show xxx nodes with multi-host TPU setup is not included / excluded in the resrouces.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo I see. That makes sense and I agree to the concern. I extended a message so that the users are noted that the Multi-host TPU nodes in their GKE cluster are excluded from the display.

Kubernetes per node accelerator availability (Note: Multi-host TPUs are detected and excluded from the display as multi-host TPUs are not supported.)
NODE_NAME                                  GPU_NAME              TOTAL_GPUS  FREE_GPUS
gke-mix-tpu-dy-default-pool-ad5bdc4d-9lw4  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-bs86  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-cfxn  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-nr5x  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-qgjt  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-rl37  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-v4ts  None                  0           0
gke-mix-tpu-dy-default-pool-ad5bdc4d-zp2x  None                  0           0
gke-tpu-a3716138-984x                      tpu-v5-lite-podslice  4           0
gke-tpu-c5117ac4-qfzt                      tpu-v5-lite-podslice  1           1

yield (f'{colorama.Fore.CYAN}{colorama.Style.BRIGHT}'
f'Kubernetes per node GPU availability'
f'{k8s_per_node_acc_message}'
landscapepainter marked this conversation as resolved.
Show resolved Hide resolved
f'{colorama.Style.RESET_ALL}\n')
yield from k8s_node_table.get_string()
if kubernetes_autoscaling:
Expand Down
17 changes: 14 additions & 3 deletions sky/clouds/kubernetes.py
Original file line number Diff line number Diff line change
Expand Up @@ -365,11 +365,19 @@ def make_deploy_resources_variables(

k8s_acc_label_key = None
k8s_acc_label_value = None
k8s_topology_label_key = None
k8s_topology_label_value = None
tpu_requested = False

# If GPUs are requested, set node label to match the GPU type.
# If GPU/TPUs are requested, set node label to match the GPU/TPU type.
if acc_count > 0 and acc_type is not None:
k8s_acc_label_key, k8s_acc_label_value = \
kubernetes_utils.get_gpu_label_key_value(context, acc_type)
(k8s_acc_label_key, k8s_acc_label_value, k8s_topology_label_key,
k8s_topology_label_value) = (
kubernetes_utils.get_accelerator_label_key_value(
context, acc_type, acc_count))
if (k8s_acc_label_key ==
kubernetes_utils.GKELabelFormatter.TPU_LABEL_KEY):
tpu_requested = True

port_mode = network_utils.get_port_mode(None)

Expand Down Expand Up @@ -431,6 +439,9 @@ def make_deploy_resources_variables(
'k8s_skypilot_system_namespace': _SKYPILOT_SYSTEM_NAMESPACE,
'k8s_spot_label_key': spot_label_key,
'k8s_spot_label_value': spot_label_value,
'tpu_requested': tpu_requested,
'k8s_topology_label_key': k8s_topology_label_key,
'k8s_topology_label_value': k8s_topology_label_value,
'image_id': image_id,
}

Expand Down
116 changes: 62 additions & 54 deletions sky/clouds/service_catalog/kubernetes_catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,16 +84,16 @@ def list_accelerators_realtime(
) or not kubernetes_utils.check_credentials(context)[0]:
return {}, {}, {}

has_gpu = kubernetes_utils.detect_gpu_resource(context)
has_gpu = kubernetes_utils.detect_accelerator_resource(context)
if not has_gpu:
return {}, {}, {}

label_formatter, _ = kubernetes_utils.detect_gpu_label_formatter(context)
if not label_formatter:
lf, _ = kubernetes_utils.detect_gpu_label_formatter(context)
if not lf:
return {}, {}, {}

accelerators_qtys: Set[Tuple[str, int]] = set()
key = label_formatter.get_label_key()
keys = lf.get_label_keys()
nodes = kubernetes_utils.get_kubernetes_nodes(context)
# Get the pods to get the real-time GPU usage
pods = kubernetes_utils.get_all_pods_in_kubernetes_cluster(context)
Expand All @@ -104,56 +104,64 @@ def list_accelerators_realtime(
min_quantity_filter = quantity_filter if quantity_filter else 1

for node in nodes:
if key in node.metadata.labels:
allocated_qty = 0
accelerator_name = label_formatter.get_accelerator_from_label_value(
node.metadata.labels.get(key))

# Check if name_filter regex matches the accelerator_name
regex_flags = 0 if case_sensitive else re.IGNORECASE
if name_filter and not re.match(
name_filter, accelerator_name, flags=regex_flags):
continue

accelerator_count = int(
node.status.allocatable.get('nvidia.com/gpu', 0))

# Generate the GPU quantities for the accelerators
if accelerator_name and accelerator_count > 0:
for count in range(1, accelerator_count + 1):
accelerators_qtys.add((accelerator_name, count))

for pod in pods:
# Get all the pods running on the node
if (pod.spec.node_name == node.metadata.name and
pod.status.phase in ['Running', 'Pending']):
# Iterate over all the containers in the pod and sum the
# GPU requests
for container in pod.spec.containers:
if container.resources.requests:
allocated_qty += int(
container.resources.requests.get(
'nvidia.com/gpu', 0))

accelerators_available = accelerator_count - allocated_qty

if accelerator_count >= min_quantity_filter:
quantized_count = (min_quantity_filter *
(accelerator_count // min_quantity_filter))
if accelerator_name not in total_accelerators_capacity:
total_accelerators_capacity[
accelerator_name] = quantized_count
else:
total_accelerators_capacity[
accelerator_name] += quantized_count

if accelerator_name not in total_accelerators_available:
total_accelerators_available[accelerator_name] = 0
if accelerators_available >= min_quantity_filter:
quantized_availability = min_quantity_filter * (
accelerators_available // min_quantity_filter)
total_accelerators_available[
accelerator_name] += quantized_availability
for key in keys:
if key in node.metadata.labels:
allocated_qty = 0
accelerator_name = lf.get_accelerator_from_label_value(
node.metadata.labels.get(key))

# Exclude multi-host TPUs from being processed.
# TODO(Doyoung): Remove the logic when adding support for
# multi-host TPUs.
if kubernetes_utils.is_multi_host_tpu(node.metadata.labels):
continue

# Check if name_filter regex matches the accelerator_name
regex_flags = 0 if case_sensitive else re.IGNORECASE
if name_filter and not re.match(
name_filter, accelerator_name, flags=regex_flags):
continue

# Generate the GPU quantities for the accelerators
accelerator_count = (
kubernetes_utils.get_node_accelerator_count(
node.status.allocatable))
if accelerator_name and accelerator_count > 0:
for count in range(1, accelerator_count + 1):
accelerators_qtys.add((accelerator_name, count))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a quick way of addressing the show-gpus issue is to change this logic to show only the exact count, not the range if it is type TPU:

                if accelerator_name and accelerator_count > 0:
                  if accelerator is TPU:
                    accelerators_qtys.add((accelerator_name, accelerator_count))
                  else:
                    for count in range(1, accelerator_count + 1):
                        accelerators_qtys.add((accelerator_name, count))

Copy link
Collaborator Author

@landscapepainter landscapepainter Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cblmemo @romilbhardwaj fixed at cbce4d5

$ sky show-gpus --cloud kubernetes
Kubernetes GPUs (context: gke_skypilot-375900_us-south1-a_mix-tpu-dy)
GPU                   QTY_PER_NODE  TOTAL_GPUS  TOTAL_FREE_GPUS
tpu-v5-lite-podslice  1, 4          5           5

Kubernetes per node accelerator availability (Note: Multi-host TPUs are not supported.)
NODE_NAME                                  GPU_NAME              TOTAL_GPUS  FREE_GPUS
gke-mix-tpu-dy-default-pool-439ab6e7-7vk4  None                  0           0
gke-mix-tpu-dy-default-pool-439ab6e7-fjdh  None                  0           0
gke-tpu-18503f8f-v441                      tpu-v5-lite-podslice  4           4
gke-tpu-5af36f0c-q74l                      tpu-v5-lite-podslice  1           1


for pod in pods:
# Get all the pods running on the node
if (pod.spec.node_name == node.metadata.name and
pod.status.phase in ['Running', 'Pending']):
# Iterate over all the containers in the pod and sum
# the GPU requests
for container in pod.spec.containers:
if container.resources.requests:
allocated_qty += (
kubernetes_utils.get_node_accelerator_count(
container.resources.requests))

accelerators_available = accelerator_count - allocated_qty

if accelerator_count >= min_quantity_filter:
quantized_count = (
min_quantity_filter *
(accelerator_count // min_quantity_filter))
if accelerator_name not in total_accelerators_capacity:
total_accelerators_capacity[
accelerator_name] = quantized_count
else:
total_accelerators_capacity[
accelerator_name] += quantized_count

if accelerator_name not in total_accelerators_available:
total_accelerators_available[accelerator_name] = 0
if accelerators_available >= min_quantity_filter:
quantized_availability = min_quantity_filter * (
accelerators_available // min_quantity_filter)
total_accelerators_available[
accelerator_name] += quantized_availability

result = []

Expand Down
6 changes: 5 additions & 1 deletion sky/clouds/utils/gcp_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from sky import sky_logging
from sky import skypilot_config
from sky.provision.gcp import constants
from sky.provision.kubernetes import utils as kubernetes_utils
from sky.utils import subprocess_utils

if typing.TYPE_CHECKING:
Expand All @@ -35,7 +36,10 @@ def is_tpu(resources: Optional['resources_lib.Resources']) -> bool:
def is_tpu_vm(resources: Optional['resources_lib.Resources']) -> bool:
if not is_tpu(resources):
return False
assert resources is not None
assert (resources is not None and len(resources.accelerators) == 1)
acc, _ = list(resources.accelerators.items())[0]
cblmemo marked this conversation as resolved.
Show resolved Hide resolved
if kubernetes_utils.is_tpu_on_gke(acc):
return False
if resources.accelerator_args is None:
return True
return resources.accelerator_args.get('tpu_vm', True)
Expand Down
Loading
Loading