Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Add Azure ML Compute Instance Support #3905

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

[Core] Add Azure ML Compute Instance Support #3905

wants to merge 9 commits into from

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Sep 3, 2024

Add Support for Azure ML Compute Instance.

Related: skypilot-org/skypilot-catalog#84

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • sky launch, sky stop, sky exec, sky start works well for Azure ML
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cblmemo cblmemo changed the title [WIP][Core] Add Azure ML Compute Instance Support [Core] Add Azure ML Compute Instance Support Sep 19, 2024
@cblmemo cblmemo marked this pull request as ready for review September 19, 2024 21:33
@cblmemo
Copy link
Collaborator Author

cblmemo commented Sep 19, 2024

Added AZ ML Catalog. This is ready for a review! @Michaelvll

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 1, 2024

@Michaelvll bump for this

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 10, 2024

bump for review @Michaelvll

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the support of Azure ML cluster @cblmemo! It seems we are adding some specific logic for Azure ml cluster in Resources. Can we avoid those?

Comment on lines +2620 to +2634
# Check whether Azure cluster uses Azure ML API
if launched_resources.cloud.is_same_cloud(clouds.Azure()):
task_use_az_ml = skypilot_config.get_nested(('azure', 'use_az_ml'),
False)
cluster_use_az_ml = launched_resources.use_az_ml
if cluster_use_az_ml != task_use_az_ml:
task_str = 'uses' if task_use_az_ml else 'does not use'
cluster_str = 'uses' if cluster_use_az_ml else 'does not use'
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesMismatchError(
f'Task requirements {task_str} Azure ML API, but the '
f'specified cluster {cluster_name} {cluster_str} it. '
f'Please set azure.use_az_ml to {cluster_use_az_ml} in '
'~/.sky/config.yaml.')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the existing cluster's setting even if the config has changed to use ML API. This is to be consistent with other part of the code, e.g., if a user specify GCP DWS, and having a existing cluster not using DWS, we should still allow task to be submitted to that cluster.

Comment on lines +3529 to +3542
if handle.launched_resources.cloud.is_same_cloud(clouds.Azure()):
task_use_az_ml = skypilot_config.get_nested(('azure', 'use_az_ml'),
False)
cluster_use_az_ml = handle.launched_resources.use_az_ml
if cluster_use_az_ml != task_use_az_ml:
task_str = 'uses' if task_use_az_ml else 'does not use'
cluster_str = 'uses' if cluster_use_az_ml else 'does not use'
with ux_utils.print_exception_no_traceback():
raise exceptions.ResourcesMismatchError(
f'Current setup {task_str} Azure ML API, but the '
f'specified cluster {cluster_name} to terminate '
f'{cluster_str} it. Please set azure.use_az_ml '
f'to {cluster_use_az_ml} in ~/.sky/config.yaml.')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, we should just allow the teardown using the non-ML API if the existing cluster was launched using non-ML API

Comment on lines +47 to +48
def _use_az_ml() -> bool:
return skypilot_config.get_nested(('azure', 'use_az_ml'), False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of checking the config in skypilot config, it would be better to keep the config in cluster yaml (e.g., provider_config section) and create instances purely based on the cluster yaml. In this case, it is easier to allow old clusters to still work when user change skypiot config.

Comment on lines +224 to +225
self._use_az_ml = skypilot_config.get_nested(('azure', 'use_az_ml'),
False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a cloud-specific setting in Resources is not clean. Can we get rid of it?

Comment on lines +997 to +1015
def _try_validate_az_ml(self) -> None:
if not self.use_az_ml:
return
if self.ports is not None:
with ux_utils.print_exception_no_traceback():
raise ValueError(
'Open ports are not supported for Azure Machine Learning.')
if (self.disk_tier is not None or
self.disk_size != _DEFAULT_DISK_SIZE_GB):
# Azure ML does not support custom disk size and disk tier.
# Reference: https://stackoverflow.com/questions/66923216/change-disk-type-azure-ml # pylint: disable=line-too-long
with ux_utils.print_exception_no_traceback():
raise ValueError('Custom disk size and disk tier are not '
'supported for Azure Machine Learning.')
if self.image_id is not None and self.extract_docker_image() is None:
with ux_utils.print_exception_no_traceback():
raise ValueError(
'Custom image is not supported for Azure Machine Learning.')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not clean. Why can't we add the check to unsupported_features of a specific cloud?

Comment on lines +1603 to +1604
if version < 20:
self._use_az_ml = state.pop('_use_az_ml', False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid this.

Comment on lines 440 to 448
# Already checked in resources
assert ssh_proxy_command is None, (
'ssh_proxy_command is not supported when using docker.')
self.ip = 'localhost'
self.ssh_user = docker_user
self.port = constants.DEFAULT_DOCKER_PORT
self._docker_ssh_proxy_command = lambda ssh: ' '.join(
ssh + ssh_options_list(ssh_private_key, None
ssh + ssh_options_list(ssh_private_key, None, port=port
) + ['-W', '%h:%p', f'{ssh_user}@{ip}'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an Azure ML cluster-specific change? Should we have another PR for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants