Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure] Fix to sync NSG status while opening ports #3844

Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 72 additions & 48 deletions sky/provision/azure/instance.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,9 @@

_RESOURCE_GROUP_NOT_FOUND_ERROR_MESSAGE = 'ResourceGroupNotFound'
_POLL_INTERVAL = 1
# Naming convention we use to create Network Security Group for provisioned
# VM at our ARM template, sky/provision/azure/azure-config-template.json.
_NSG_NAME = 'sky-{cluster_name_on_cloud}-nsg'


class AzureInstanceStatus(enum.Enum):
Expand Down Expand Up @@ -714,62 +717,83 @@ def open_ports(
subscription_id = provider_config['subscription_id']
resource_group = provider_config['resource_group']
network_client = azure.get_client('network', subscription_id)
nsg_name = _NSG_NAME.format(cluster_name_on_cloud=cluster_name_on_cloud)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid assuming the nsg name for a cluster to be more robust to future changes, such #3845? We can just check all the NSG in the same resource group as we are using one resource group per cluster.

By doing that, let's keep the while and for loop in the same order as the original.

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Thanks! I have two questions to address:

  1. If I understand correctly, it doesn't seem like we can keep the original order of while and for loop. It seems necessary to call list_network_security_groups(resource_group) within the while loop (while not nsg_created:) so that the NSG status is updated on each iteration. This ensures that the check if nsg.provisioning_state not in ['Creating', 'Updating']: accurately reflects the latest state within the for loop (for nsg in nsg_list:). However, I might be overlooking something. Please let me know if that’s the case.

  2. I can remove the check for specific nsg name from this PR, but this was also intended to be used for [Azure] Allow resource group specifiation for Azure instance provisioning #3764 as well. As you mentioned, we currently use one resource group per cluster, but this changes when resource group specification is used with sky serve. Serve controller and all the replicas would be created under the user specified resource group, and in this case, there would be multiple NSGs under the resource group. This was causing issues as opening ports while launching replica 4 was checking on NSG creation for replica 5 and got stuck.

    If losing track of the NSG name for the cluster throughout the launch cycle is your concern, we can keep on track of the NSG name while passing the name as a parameter to our ARM template instead of directly creating it from the ARM("nsgName": "[concat('sky-', parameters('clusterId'), '-nsg')]") and perhaps have that stored in provider_config as well, so we can read from it instead of using _NSG_NAME.format().

    2-1. Is it possible that we may need multiple NSGs for a single cluster in near future? We don't seem to do this now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 1, can we check if there is an Azure API that get a single nsg instead of listing all of them, so that we can keep getting that nsg within the while loop? The current loop order is unintuitive, looping through the nsg list within a retry loop. It would be more readable to do the retry loop for each NSG (although we only select a single NSG).

For 2, it makes sense when a resource group can be specified. In that case, how about we just create a function in azure/config.py that generates the NSG name, so it can be reused in both config.py and here? We have to be careful with backward compatibility then, if we do the exact match of the NSG. How about we allow a list of NSG names to support legacy names, such as ray-{clusterId}-... and the one that will be added with #3848

Copy link
Collaborator Author

@landscapepainter landscapepainter Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michaelvll Following your review, I found an API method .get to directly retrieve the nsg object. But there's an issue with using this with multiple possible NSG names(legacy names). Attempting to get an nsg instance with the following snippet takes about ~15s to return a ResourceNotFoundError when given a non-existent nsg name:

nsg = network_client.network_security_groups.get(resource_group_name, nsg_name)

And we have to provide non-existent nsg name at most twice with two legacy nsg names, which adds ~30s, to provision VM. So, I instead kept the listing method to iterate through the NSG list to find a matching name among our legacy and current nsg names. This is done in a separate method, _get_cluster_nsg, so it also resolves your concern on readability.

Also, following your comment, I created a method to create NSG name with our current naming convention and used it within config.py::bootstrap_instances and instance.py::_get_cluster_nsg, which is called by instance.py::open_ports.

These are done at befcbc1


update_network_security_groups = _get_azure_sdk_function(
client=network_client.network_security_groups,
function_name='create_or_update')
list_network_security_groups = _get_azure_sdk_function(
client=network_client.network_security_groups, function_name='list')
for nsg in list_network_security_groups(resource_group):
try:
# Wait the NSG creation to be finished before opening a port. The
# cluster provisioning triggers the NSG creation, but it may not be
# finished yet.
backoff = common_utils.Backoff(max_backoff_factor=1)
start_time = time.time()
while True:
if nsg.provisioning_state not in ['Creating', 'Updating']:
break
if time.time() - start_time > _WAIT_CREATION_TIMEOUT_SECONDS:
logger.warning(
f'Fails to wait for the creation of NSG {nsg.name} in '
f'{resource_group} within '
f'{_WAIT_CREATION_TIMEOUT_SECONDS} seconds. '
'Skip this NSG.')
backoff_time = backoff.current_backoff()
logger.info(f'NSG {nsg.name} is not created yet. Waiting for '
f'{backoff_time} seconds before checking again.')
time.sleep(backoff_time)

# Azure NSG rules have a priority field that determines the order
# in which they are applied. The priority must be unique across
# all inbound rules in one NSG.
priority = max(rule.priority
for rule in nsg.security_rules
if rule.direction == 'Inbound') + 1
nsg.security_rules.append(
azure.create_security_rule(
name=f'sky-ports-{cluster_name_on_cloud}-{priority}',
priority=priority,
protocol='Tcp',
access='Allow',
direction='Inbound',
source_address_prefix='*',
source_port_range='*',
destination_address_prefix='*',
destination_port_ranges=ports,
))
poller = update_network_security_groups(resource_group, nsg.name,
nsg)
poller.wait()
if poller.status() != 'Succeeded':

try:
# Wait for the NSG creation to be finished before opening a port. The
# cluster provisioning triggers the NSG creation, but it may not be
# finished yet.
nsg_to_open_ports = None
nsg_created = False
backoff = common_utils.Backoff(max_backoff_factor=1)
start_time = time.time()
while not nsg_created:
if time.time() - start_time > _WAIT_CREATION_TIMEOUT_SECONDS:
with ux_utils.print_exception_no_traceback():
raise ValueError(f'Failed to open ports {ports} in NSG '
f'{nsg.name}: {poller.status()}')
except azure.exceptions().HttpResponseError as e:
raise TimeoutError(
f'Timed out while waiting for the Network '
f'Security Group {nsg_name!r} to be ready for '
f'cluster {cluster_name_on_cloud!r} in '
f'resource group {resource_group!r}. The NSG '
f'did not reach a stable state '
'(Creating/Updating) within the allocated '
f'{_WAIT_CREATION_TIMEOUT_SECONDS} seconds. '
'Consequently, the operation to open ports '
f'{ports} failed.')

nsg_list = list_network_security_groups(resource_group)
for nsg in nsg_list:
if nsg_name == nsg.name:
if nsg.provisioning_state not in ['Creating', 'Updating']:
nsg_to_open_ports = nsg
nsg_created = True
break

backoff_time = backoff.current_backoff()
logger.info(
f'NSG {nsg_name} is not created yet. Waiting for '
f'{backoff_time} seconds before checking again.')
time.sleep(backoff_time)

# Azure NSG rules have a priority field that determines the order
# in which they are applied. The priority must be unique across
# all inbound rules in one NSG.
assert nsg_to_open_ports is not None
priority = max(rule.priority
for rule in nsg_to_open_ports.security_rules
if rule.direction == 'Inbound') + 1
nsg_to_open_ports.security_rules.append(
azure.create_security_rule(
name=f'sky-ports-{cluster_name_on_cloud}-{priority}',
priority=priority,
protocol='Tcp',
access='Allow',
direction='Inbound',
source_address_prefix='*',
source_port_range='*',
destination_address_prefix='*',
destination_port_ranges=ports,
))
poller = update_network_security_groups(resource_group,
nsg_to_open_ports.name,
nsg_to_open_ports)
poller.wait()
if poller.status() != 'Succeeded':
with ux_utils.print_exception_no_traceback():
raise ValueError(
f'Failed to open ports {ports} in NSG {nsg.name}.') from e
raise ValueError(f'Failed to open ports {ports} in NSG '
f'{nsg_to_open_ports.name}: {poller.status()}')

except azure.exceptions().HttpResponseError as e:
with ux_utils.print_exception_no_traceback():
raise ValueError(f'Failed to open ports {ports} in NSG for cluster '
f'{cluster_name_on_cloud!r} within resource group '
f'{resource_group!r}.') from e


def cleanup_ports(
Expand Down
Loading