Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Fix to sync NSG status while opening ports #3844

Conversation

landscapepainter
Copy link
Collaborator

@landscapepainter landscapepainter commented Aug 18, 2024

This resolves #3843.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py --azure besides the ones failing from master branch.
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@landscapepainter landscapepainter changed the title [Azure] Fix to update NSG status while opening ports [Azure] Fix to sync NSG status while opening ports Aug 19, 2024
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @landscapepainter! Left two comments. : )

@@ -714,62 +717,83 @@ def open_ports(
subscription_id = provider_config['subscription_id']
resource_group = provider_config['resource_group']
network_client = azure.get_client('network', subscription_id)
nsg_name = _NSG_NAME.format(cluster_name_on_cloud=cluster_name_on_cloud)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid assuming the nsg name for a cluster to be more robust to future changes, such #3845? We can just check all the NSG in the same resource group as we are using one resource group per cluster.

By doing that, let's keep the while and for loop in the same order as the original.

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Thanks! I have two questions to address:

  1. If I understand correctly, it doesn't seem like we can keep the original order of while and for loop. It seems necessary to call list_network_security_groups(resource_group) within the while loop (while not nsg_created:) so that the NSG status is updated on each iteration. This ensures that the check if nsg.provisioning_state not in ['Creating', 'Updating']: accurately reflects the latest state within the for loop (for nsg in nsg_list:). However, I might be overlooking something. Please let me know if that’s the case.

  2. I can remove the check for specific nsg name from this PR, but this was also intended to be used for [Azure] Allow resource group specifiation for Azure instance provisioning #3764 as well. As you mentioned, we currently use one resource group per cluster, but this changes when resource group specification is used with sky serve. Serve controller and all the replicas would be created under the user specified resource group, and in this case, there would be multiple NSGs under the resource group. This was causing issues as opening ports while launching replica 4 was checking on NSG creation for replica 5 and got stuck.

    If losing track of the NSG name for the cluster throughout the launch cycle is your concern, we can keep on track of the NSG name while passing the name as a parameter to our ARM template instead of directly creating it from the ARM("nsgName": "[concat('sky-', parameters('clusterId'), '-nsg')]") and perhaps have that stored in provider_config as well, so we can read from it instead of using _NSG_NAME.format().

    2-1. Is it possible that we may need multiple NSGs for a single cluster in near future? We don't seem to do this now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1, can we check if there is an Azure API that get a single nsg instead of listing all of them, so that we can keep getting that nsg within the while loop? The current loop order is unintuitive, looping through the nsg list within a retry loop. It would be more readable to do the retry loop for each NSG (although we only select a single NSG).

For 2, it makes sense when a resource group can be specified. In that case, how about we just create a function in azure/config.py that generates the NSG name, so it can be reused in both config.py and here? We have to be careful with backward compatibility then, if we do the exact match of the NSG. How about we allow a list of NSG names to support legacy names, such as ray-{clusterId}-... and the one that will be added with #3848

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Following your review, I found an API method .get to directly retrieve the nsg object. But there's an issue with using this with multiple possible NSG names(legacy names). Attempting to get an nsg instance with the following snippet takes about ~15s to return a ResourceNotFoundError when given a non-existent nsg name:

nsg = network_client.network_security_groups.get(resource_group_name, nsg_name)

And we have to provide non-existent nsg name at most twice with two legacy nsg names, which adds ~30s, to provision VM. So, I instead kept the listing method to iterate through the NSG list to find a matching name among our legacy and current nsg names. This is done in a separate method, _get_cluster_nsg, so it also resolves your concern on readability.

Also, following your comment, I created a method to create NSG name with our current naming convention and used it within config.py::bootstrap_instances and instance.py::_get_cluster_nsg, which is called by instance.py::open_ports.

These are done at befcbc1

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @landscapepainter! Mostly looks good to me with some minor fixes for the function structure

sky/provision/azure/instance.py Show resolved Hide resolved
sky/provision/azure/config.py Outdated Show resolved Hide resolved
sky/provision/azure/config.py Outdated Show resolved Hide resolved
sky/provision/azure/config.py Outdated Show resolved Hide resolved
sky/provision/azure/instance.py Outdated Show resolved Hide resolved
landscapepainter and others added 6 commits September 10, 2024 20:08
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
@landscapepainter
Copy link
Collaborator Author

Thanks @Michaelvll! This is ready for another look.

@landscapepainter landscapepainter added this to the v0.7 milestone Oct 23, 2024
Copy link
Collaborator

@yika-luo yika-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested performance and basic sky operations. LGTM!

sky/provision/azure/instance.py Show resolved Hide resolved
sky/provision/azure/instance.py Show resolved Hide resolved
sky/provision/azure/instance.py Show resolved Hide resolved
@landscapepainter landscapepainter added this pull request to the merge queue Oct 25, 2024
Merged via the queue into skypilot-org:master with commit 057bc4b Oct 25, 2024
20 checks passed
@landscapepainter landscapepainter deleted the auzre-open-port-wait-nsg-creation-fix branch October 25, 2024 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Azure] Fix check for NSG creation while opening port
3 participants