Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Support fractional A10 instance types #3877

Merged
merged 29 commits into from
Oct 26, 2024
Merged

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Aug 26, 2024

Closes #3708

This PR support fractional A10 instance types from instance_type=xxx and accelerators=A10:{0.25,0.5,0.75}.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
# Launch cluster with
$ sky launch --instance-type Standard_NV6ads_A10_v5 -c sky-59bd-memory
# or
$ sky launch --gpus A10:0.25 -c sky-59bd-memory
# then
$ sky launch --gpus A10:0.24 -c sky-59bd-memory
sky.exceptions.ResourcesMismatchError: Task requested resources with fractional accelerator counts. For fractional counts, the required count must match the existing cluster. Got required accelerator A10:0.24 but the existing cluster has A10:0.25.
$ sky launch --gpus A10:1 -c sky-59bd-memory
sky.exceptions.ResourcesMismatchError: Requested resources do not match the existing cluster.
  Requested:    {1x <Cloud>({'A10': 1})}
  Existing:     1x Azure(Standard_NV6ads_A10_v5, {'A10': 0.25})
To fix: specify a new cluster name, or down the existing cluster first: sky down sky-59bd-memory
$ sky launch --gpus A10:0.25 -c sky-59bd-memory nvidia-smi
Task from command: nvidia-smi
Running task on cluster sky-59bd-memory...
W 08-28 11:16:57 cloud_vm_ray_backend.py:1937] Trying to launch an A10 cluster on Azure. This may take ~20 minutes due to driver installation.
I 08-28 11:16:57 cloud_vm_ray_backend.py:1314] To view detailed progress: tail -n100 -f /home/memory/sky_logs/sky-2024-08-28-11-16-56-005018/provision.log
I 08-28 11:16:58 provisioner.py:65] Launching on Azure eastus (all zones)
I 08-28 11:17:07 provisioner.py:450] Successfully provisioned or found existing instance.
I 08-28 11:17:19 provisioner.py:552] Successfully provisioned cluster: sky-59bd-memory
I 08-28 11:17:21 cloud_vm_ray_backend.py:3294] Job submitted with Job ID: 3
I 08-28 18:17:22 log_lib.py:412] Start streaming logs for job 3.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['10.8.0.4']
(sky-cmd, pid=15884) Wed Aug 28 18:17:23 2024       
(sky-cmd, pid=15884) +---------------------------------------------------------------------------------------+
(sky-cmd, pid=15884) | NVIDIA-SMI 535.161.08             Driver Version: 535.161.08   CUDA Version: 12.2     |
(sky-cmd, pid=15884) |-----------------------------------------+----------------------+----------------------+
(sky-cmd, pid=15884) | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
(sky-cmd, pid=15884) | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
(sky-cmd, pid=15884) |                                         |                      |               MIG M. |
(sky-cmd, pid=15884) |=========================================+======================+======================|
(sky-cmd, pid=15884) |   0  NVIDIA A10-4Q                  On  | 00000002:00:00.0 Off |                    0 |
(sky-cmd, pid=15884) | N/A   N/A    P0              N/A /  N/A |      0MiB /  4096MiB |      0%      Default |
(sky-cmd, pid=15884) |                                         |                      |             Disabled |
(sky-cmd, pid=15884) +-----------------------------------------+----------------------+----------------------+
(sky-cmd, pid=15884)                                                                                          
(sky-cmd, pid=15884) +---------------------------------------------------------------------------------------+
(sky-cmd, pid=15884) | Processes:                                                                            |
(sky-cmd, pid=15884) |  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
(sky-cmd, pid=15884) |        ID   ID                                                             Usage      |
(sky-cmd, pid=15884) |=======================================================================================|
(sky-cmd, pid=15884) |  No running processes found                                                           |
(sky-cmd, pid=15884) +---------------------------------------------------------------------------------------+
INFO: Job finished (status: SUCCEEDED).
Shared connection to 23.101.130.81 closed.
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] Job ID: 3
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] To cancel the job:       sky cancel sky-59bd-memory 3
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] To stream job logs:      sky logs sky-59bd-memory 3
I 08-28 11:17:24 cloud_vm_ray_backend.py:3329] To view the job queue:   sky queue sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] 
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] Cluster name: sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To log into the head VM: ssh sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To submit a job:         sky exec sky-59bd-memory yaml_file
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To stop the cluster:     sky stop sky-59bd-memory
I 08-28 11:17:24 cloud_vm_ray_backend.py:3425] To teardown the cluster: sky down sky-59bd-memory
Clusters
NAME             LAUNCHED        RESOURCES                                        STATUS  AUTOSTOP  COMMAND                       
sky-59bd-memory  a few secs ago  1x Azure(Standard_NV6ads_A10_v5, {'A10': 0.25})  UP      -         sky launch -c sky-59bd-me... 
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Comment on lines 151 to 153
# Filter out instance types that only contain a fractional of GPU.
df_filtered = _df.loc[~_df['InstanceType'].isin(_FILTERED_A10_INSTANCE_TYPES
)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of excluding the instances directly, can we print out some hints like the one when we specify sky launch --gpus L4:

Multiple AWS instances satisfy L4:1. The cheapest AWS(g6.xlarge, {'L4': 1}) is considered among:
I 08-27 06:09:54 optimizer.py:922] ['g6.xlarge', 'g6.2xlarge', 'g6.4xlarge', 'gr6.4xlarge', 'g6.8xlarge', 'gr6.8xlarge', 'g6.16xlarge'].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hint is used to print instances w/ same accelerator number. I'm thinking if we should do this to fractional GPUs..

@cblmemo cblmemo changed the title [Azure] Support fractional A10 instance types only from instance_type=xxx [Azure] Support fractional A10 instance types Aug 28, 2024
@cblmemo
Copy link
Collaborator Author

cblmemo commented Aug 28, 2024

I support launching from --gpus A10:0.25 and only allow strict equal on fractional GPU requirements. Also updated the test I've done in the PR description. PTAL!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @cblmemo! Mostly looks good to me with some slight issue.

Comment on lines 283 to 285
# Manually update the GPU count for fractional A10 instance types.
df_ret['AcceleratorCount'] = df_ret.apply(_upd_a10_gpu_count,
axis='columns')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we say more in the comment for why we need to do it manually?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added. PTAL

sky/clouds/service_catalog/scp_catalog.py Show resolved Hide resolved
sky/clouds/azure.py Show resolved Hide resolved
sky/resources.py Outdated
Comment on lines 1145 to 1154
if isinstance(self.accelerators[acc], float) or isinstance(
other_accelerators[acc], float):
# If the requested accelerator count is a float, we only
# allow strictly equal counts since all of the float point
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and
# we want to avoid semantic ambiguity (e.g. launching
# with --gpus A10:0.25 on a A10:0.75 cluster).
if not math.isclose(self.accelerators[acc],
other_accelerators[acc]):
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should allow requested resources to be a float, while the existing accelerators to be a int, as long as requested resources is <= existing resources.

That said, the

Suggested change
if isinstance(self.accelerators[acc], float) or isinstance(
other_accelerators[acc], float):
# If the requested accelerator count is a float, we only
# allow strictly equal counts since all of the float point
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and
# we want to avoid semantic ambiguity (e.g. launching
# with --gpus A10:0.25 on a A10:0.75 cluster).
if not math.isclose(self.accelerators[acc],
other_accelerators[acc]):
return False
if isinstance(other_accelerators[acc], float) and not other_accelerators[acc].is_integer():
# If the requested accelerator count is a float, we only
# allow strictly equal counts since all of the float point
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and
# we want to avoid semantic ambiguity (e.g. launching
# with --gpus A10:0.25 on a A10:0.75 cluster).
if not math.isclose(self.accelerators[acc],
other_accelerators[acc]):
return False

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Updated. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, after a second thought, I think we should still keep the original isinstance(self.accelerators[acc], float) or isinstance(other_accelerators[acc], float) condition. Considering the following case: user submit the jobs with --gpus A10:0.5 and the cluster has A10:1. In fact the requirements 0.5 will be translated to 1 and thus the user can only have one A10:0.5 job instead of 2, which is confusing. The original condition capture such case but the updated one (isinstance(other_accelerators[acc], float) and not other_accelerators[acc].is_integer()) does not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did allow having two A10:0.5 running on a single cluster with A10:1. Do you know when did we change the behavior of this? or did we ever change this behavior before this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix this before we merge the PR

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support @cblmemo! It seems good to me. Please do some tests to make sure the changes do not cause issues with other clouds and other ACC types (considering we have changed significant amount of places)

Comment on lines 2678 to 2683
'Task requested resources with fractional '
'accelerator counts. For fractional '
'counts, the required count must match the '
'existing cluster. Got required accelerator'
f' {acc}:{self_count} but the existing '
f'cluster has {acc}:{existing_count}.')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is not accurate? Our check is for ACC count of existing cluster instead of the task requested resources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the above comments 🤔

sky/resources.py Outdated
Comment on lines 1145 to 1154
if isinstance(self.accelerators[acc], float) or isinstance(
other_accelerators[acc], float):
# If the requested accelerator count is a float, we only
# allow strictly equal counts since all of the float point
# accelerator counts are less than 1 (e.g., 0.1, 0.5), and
# we want to avoid semantic ambiguity (e.g. launching
# with --gpus A10:0.25 on a A10:0.75 cluster).
if not math.isclose(self.accelerators[acc],
other_accelerators[acc]):
return False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix this before we merge the PR

sky/resources.py Outdated Show resolved Hide resolved
sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 11, 2024

Just identified another bug: for a A10:0.5 cluster, previous implementation would force using --gpus A10:0.5 when sky exec, which could actually have 2 jobs simultaneous running as the ray cluster has resources A10:1. Just fixed by if we found a fractional cluster, then we set the gpu demand to its ceiling value (which is essentially 1).

cblmemo and others added 2 commits September 11, 2024 00:04
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 11, 2024

Actually, after a second thought, I think we could even allow requiring --gpus A10:0.25 for a A10:0.5 cluster - we just need to convert it. The actual required num of gpus can be calculated by {required_count} / {cluster_acc_count} * 1, as we set the remote ray cluster's custom resources to A10:1. For this example it should require A10:0.5, so that two --gpus A10:0.25 can running simultaneously for a A10:0.5 cluster.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 11, 2024

Another TODO: I just found that there are 1/6 and 1/3 A10 instance types. We need to figure out a precision to display such decimals.

@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 12, 2024

All todos is done. After smoke test it should be able to merge ;)

return int(value)
return float(value)

return {acc_name: _convert(acc_count)}


def get_instance_type_for_accelerator_impl(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the acc_count type here? Also, when we are comparing the acc_count should we make sure every number within math.abs(df['AcceleratorCount'] - acc_count) <= 0.01 should work. Otherwise, a user running sky launch --gpus A10:0.16 or sky launch --gpus A10:0.1666 would fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the acc_count type here?

Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?

Also, when we are comparing the acc_count should we make sure every number within math.abs(df['AcceleratorCount'] - acc_count) <= 0.01 should work.

For this, I'm slightly concerned about the case when user:

sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu
sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full
sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.

To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:

sky launch --gpus A10:0.16
I 09-12 22:34:50 optimizer.py:1301] No resource satisfying <Cloud>({'A10': 0.16}) on [AWS, GCP, Azure, RunPod].
I 09-12 22:34:50 optimizer.py:1305] Did you mean: ['A100-80GB-SXM:1', 'A100-80GB-SXM:2', 'A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:1', 'A100:16', 'A100:2', 'A100:4', 'A100:8', 'A10:0.167', 'A10:0.333', 'A10:0.5', 'A10:1', 'A10:2', 'A10G:1', 'A10G:4', 'A10G:8']
sky.exceptions.ResourcesUnavailableError: Catalog does not contain any instances satisfying the request:
Task<name=sky-cmd>(run=<empty>)
  resources: <Cloud>({'A10': 0.16}).

To fix: relax or change the resource requirements.
Try one of these offered accelerators: ['A100-80GB-SXM:1', 'A100-80GB-SXM:2', 'A100-80GB-SXM:4', 'A100-80GB-SXM:8', 'A100-80GB:1', 'A100-80GB:2', 'A100-80GB:4', 'A100-80GB:8', 'A100:1', 'A100:16', 'A100:2', 'A100:4', 'A100:8', 'A10:0.167', 'A10:0.333', 'A10:0.5', 'A10:1', 'A10:2', 'A10G:1', 'A10G:4', 'A10G:8']

Hint: sky show-gpus to list available accelerators.
      sky check to check the enabled clouds.

Does that sounds good to you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?

acc_count is currently int for the type annotation below. Should we update that?

For this, I'm slightly concerned about the case when user:
sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu
sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full
sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.
To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:

This will only apply for the case when a user is actually creating an instance with A10:0.16 for this function right? When we are launching an instance, once we returned the instance type, we can round up the request to the actual acc_count in the catalog?

Copy link
Collaborator Author

@cblmemo cblmemo Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

acc_count is currently int for the type annotation below. Should we update that?

Done!

This will only apply for the case when a user is actually creating an instance with A10:0.16 for this function right? When we are launching an instance, once we returned the instance type, we can round up the request to the actual acc_count in the catalog?

Good point! Done. PTAL!

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 1, 2024

@Michaelvll bump for this - will fix the conflict soon

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 10, 2024

bump for review @Michaelvll

sky/backends/cloud_vm_ray_backend.py Outdated Show resolved Hide resolved
return int(value)
return float(value)

return {acc_name: _convert(acc_count)}


def get_instance_type_for_accelerator_impl(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sry could you elaborate on this...? Are you saying there are a better place to update the type or what..?

acc_count is currently int for the type annotation below. Should we update that?

For this, I'm slightly concerned about the case when user:
sky launch -c a10-frac --gpus A10:0.16 # detected as 0.167 in catalog, then launch cluster with 0.167 gpu
sky exec a10-frac --gpus A10:0.16 sleep 100000 # user would think the cluster is full
sky exec a10-frac --gpus A10:0.007 sleep 100000 # however, this still running as the cluster is actually launched with 0.167 gpu.
To deal with the failing, we currently shows all valid instance type as fuzzy candidate and user could then modify their acc count:

This will only apply for the case when a user is actually creating an instance with A10:0.16 for this function right? When we are launching an instance, once we returned the instance type, we can round up the request to the actual acc_count in the catalog?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cblmemo! LGTM. This should be good to go once the tests passed.

sky/clouds/service_catalog/common.py Outdated Show resolved Hide resolved
cblmemo and others added 3 commits October 25, 2024 13:14
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 25, 2024

Manual test passed. Running smoke test now!

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 26, 2024

Most of the smoke tests passed. Currently still failed one: #4192, some AWS bucker permission issue, and TPU tests which is due to quota constraints. It should not be relevant to this PR. Merging now

@cblmemo cblmemo added this pull request to the merge queue Oct 26, 2024
Merged via the queue into master with commit 647fcea Oct 26, 2024
20 checks passed
@cblmemo cblmemo deleted the support-fractional-a10 branch October 26, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Catalog] Special instance type on Azure only holds a fractional of GPU but tagged as one whole GPU in catalog
2 participants