-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Azure] Fix to sync NSG status while opening ports #3844
Merged
landscapepainter
merged 16 commits into
skypilot-org:master
from
landscapepainter:auzre-open-port-wait-nsg-creation-fix
Oct 25, 2024
Merged
Changes from 3 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
b818282
fix to update NSG status while opening ports
landscapepainter c061595
nit
landscapepainter d644551
format
landscapepainter 514e080
Merge branch 'master' into auzre-open-port-wait-nsg-creation-fix
landscapepainter b6edd1e
Merge branch 'master' into auzre-open-port-wait-nsg-creation-fix
landscapepainter befcbc1
refactor check for nsg creation
landscapepainter dc36eab
format
landscapepainter c561619
nit
landscapepainter 9789d47
format
landscapepainter 63efed5
Update sky/provision/azure/config.py
landscapepainter a859613
Update sky/provision/azure/instance.py
landscapepainter a90ed33
Update sky/provision/azure/config.py
landscapepainter 25994a1
Update sky/provision/azure/config.py
landscapepainter 8c2d387
format
landscapepainter 51a17d5
Merge branch 'master' into auzre-open-port-wait-nsg-creation-fix
landscapepainter 2735d79
additional TODO comments
landscapepainter File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid assuming the nsg name for a cluster to be more robust to future changes, such #3845? We can just check all the NSG in the same resource group as we are using one resource group per cluster.
By doing that, let's keep the
while
andfor
loop in the same order as the original.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll Thanks! I have two questions to address:
If I understand correctly, it doesn't seem like we can keep the original order of
while
andfor
loop. It seems necessary to calllist_network_security_groups(resource_group)
within the while loop (while not nsg_created:
) so that the NSG status is updated on each iteration. This ensures that the checkif nsg.provisioning_state not in ['Creating', 'Updating']:
accurately reflects the latest state within the for loop (for nsg in nsg_list:
). However, I might be overlooking something. Please let me know if that’s the case.I can remove the check for specific nsg name from this PR, but this was also intended to be used for [Azure] Allow resource group specifiation for Azure instance provisioning #3764 as well. As you mentioned, we currently use one resource group per cluster, but this changes when resource group specification is used with
sky serve
. Serve controller and all the replicas would be created under the user specified resource group, and in this case, there would be multiple NSGs under the resource group. This was causing issues as opening ports while launching replica 4 was checking on NSG creation for replica 5 and got stuck.If losing track of the NSG name for the cluster throughout the
launch
cycle is your concern, we can keep on track of the NSG name while passing the name as a parameter to our ARM template instead of directly creating it from the ARM("nsgName": "[concat('sky-', parameters('clusterId'), '-nsg')]"
) and perhaps have that stored inprovider_config
as well, so we can read from it instead of using_NSG_NAME.format()
.2-1. Is it possible that we may need multiple NSGs for a single cluster in near future? We don't seem to do this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 1, can we check if there is an Azure API that get a single
nsg
instead of listing all of them, so that we can keep getting that nsg within the while loop? The current loop order is unintuitive, looping through the nsg list within a retry loop. It would be more readable to do the retry loop for each NSG (although we only select a single NSG).For 2, it makes sense when a resource group can be specified. In that case, how about we just create a function in
azure/config.py
that generates the NSG name, so it can be reused in bothconfig.py
and here? We have to be careful with backward compatibility then, if we do the exact match of the NSG. How about we allow a list of NSG names to support legacy names, such asray-{clusterId}-...
and the one that will be added with #3848There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michaelvll Following your review, I found an API method
.get
to directly retrieve the nsg object. But there's an issue with using this with multiple possible NSG names(legacy names). Attempting to get an nsg instance with the following snippet takes about ~15s to return aResourceNotFoundError
when given a non-existent nsg name:And we have to provide non-existent nsg name at most twice with two legacy nsg names, which adds ~30s, to provision VM. So, I instead kept the listing method to iterate through the NSG list to find a matching name among our legacy and current nsg names. This is done in a separate method,
_get_cluster_nsg
, so it also resolves your concern on readability.Also, following your comment, I created a method to create NSG name with our current naming convention and used it within
config.py::bootstrap_instances
andinstance.py::_get_cluster_nsg
, which is called byinstance.py::open_ports
.These are done at befcbc1