Skip to content

Commit

Permalink
GCP: Support for custom VPC. (#2764)
Browse files Browse the repository at this point in the history
* GCP: Support for custom VPC.

* yapf

* format.sh

* Address comments.

* Update docs

* format

* comment
  • Loading branch information
concretevitamin authored Nov 12, 2023
1 parent 31aa76c commit aecb2de
Show file tree
Hide file tree
Showing 10 changed files with 224 additions and 41 deletions.
54 changes: 51 additions & 3 deletions docs/source/cloud-setup/cloud-permissions/gcp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ User
compute.firewalls.create
compute.firewalls.delete
compute.firewalls.get
compute.instances.create
compute.instances.create
compute.instances.delete
compute.instances.get
compute.instances.list
Expand Down Expand Up @@ -148,8 +148,8 @@ User

.. note::

The user created with the above minimal permissions will not be able to create service accounts to be assigned to SkyPilot instances.
The user created with the above minimal permissions will not be able to create service accounts to be assigned to SkyPilot instances.

The admin needs to follow the :ref:`instruction below <gcp-service-account-creation>` to create a service account to be shared by all users in the project.


Expand Down Expand Up @@ -182,3 +182,51 @@ Service Account
:align: center
:alt: Set Service Account Role


.. _gcp-minimum-firewall-rules:

Firewall Rules
~~~~~~~~~~~~~~~~~~~

By default, users do not need to set up any special firewall rules to start
using SkyPilot. If the default VPC does not satisfy the minimal required rules,
a new VPC ``skypilot-vpc`` with sufficient rules will be automatically created
and used.

However, if you manually set up and instruct SkyPilot to use a VPC (see
:ref:`here <config-yaml>`), ensure it has the following required firewall rules:

.. code-block:: python
# Allow internal connections between SkyPilot VMs:
#
# controller -> head node of a cluster
# head node of a cluster <-> worker node(s) of a cluster
#
# NOTE: these ports are more relaxed than absolute minimum, but the
# sourceRanges restrict the traffic to internal IPs.
{
"direction": "INGRESS",
"allowed": [
{"IPProtocol": "tcp", "ports": ["0-65535"]},
{"IPProtocol": "udp", "ports": ["0-65535"]},
],
"sourceRanges": ["10.128.0.0/9"],
},
# Allow SSH connections from user machine(s)
#
# NOTE: This can be satisfied using the following relaxed sourceRanges
# (0.0.0.0/0), but you can customize it if you want to restrict to certain
# known public IPs (useful when using internal VPN or proxy solutions).
{
"direction": "INGRESS",
"allowed": [
{"IPProtocol": "tcp", "ports": ["22"]},
],
"sourceRanges": ["0.0.0.0/0"],
},
You can inspect and manage firewall rules at
``https://console.cloud.google.com/net-security/firewall-manager/firewall-policies/list?project=<your-project-id>``
or using any of GCP's SDKs.
39 changes: 33 additions & 6 deletions docs/source/reference/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Available fields and semantics:
# with this name (provisioner automatically looks for such regions).
# Regions without a VPC with this name will not be used to launch nodes.
#
# Default: None (use the default VPC in each region).
# Default: null (use the default VPC in each region).
vpc_name: skypilot-vpc
# Should instances be assigned private IPs only? (optional)
Expand Down Expand Up @@ -88,7 +88,7 @@ Available fields and semantics:
# and any SkyPilot nodes. (This option is not used between SkyPilot nodes,
# since they are behind the proxy / may not have such a proxy set up.)
#
# Optional; default: None.
# Optional; default: null.
### Format 1 ###
# A string; the same proxy command is used for all regions.
ssh_proxy_command: ssh -W %h:%p -i ~/.ssh/sky-key -o StrictHostKeyChecking=no ec2-user@<jump server public ip>
Expand All @@ -103,6 +103,24 @@ Available fields and semantics:
# Advanced GCP configurations (optional).
# Apply to all new instances but not existing ones.
gcp:
# VPC to use (optional).
#
# Default: null, which implies the following behavior. First, the VPC named
# 'default' is checked against minimal recommended firewall rules for
# SkyPilot to function. If it satisfies these rules, this VPC is used.
# Otherwise, a new VPC named 'skypilot-vpc' is automatically created with
# the minimal recommended firewall rules and will be used.
#
# If this field is set, SkyPilot will use the VPC with this name. Useful for
# when users want to manually set up a VPC and precisely control its
# firewall rules. If no region restrictions are given, SkyPilot only
# provisions in regions for which a subnet of this VPC exists. Errors are
# thrown if VPC with this name is not found. The VPC does not get modified
# in any way, except when opening ports (e.g., via `resources.ports`) in
# which case new firewall rules permitting public traffic to those ports
# will be added.
vpc_name: skypilot-vpc
# Reserved capacity (optional).
#
# The specific reservation to be considered when provisioning clusters on GCP.
Expand All @@ -117,15 +135,24 @@ Available fields and semantics:
# Advanced Kubernetes configurations (optional).
kubernetes:
# The networking mode for accessing SSH jump pod (optional).
# This must be either: 'nodeport' or 'portforward'. If not specified, defaults to 'portforward'.
#
# nodeport: Exposes the jump pod SSH service on a static port number on each Node, allowing external access to using <NodeIP>:<NodePort>. Using this mode requires opening multiple ports on nodes in the Kubernetes cluster.
# portforward: Uses `kubectl port-forward` to create a tunnel and directly access the jump pod SSH service in the Kubernetes cluster. Does not require opening ports the cluster nodes and is more secure. 'portforward' is used as default if 'networking' is not specified.
# This must be either: 'nodeport' or 'portforward'. If not specified,
# defaults to 'portforward'.
#
# nodeport: Exposes the jump pod SSH service on a static port number on each
# Node, allowing external access to using <NodeIP>:<NodePort>. Using this
# mode requires opening multiple ports on nodes in the Kubernetes cluster.
#
# portforward: Uses `kubectl port-forward` to create a tunnel and directly
# access the jump pod SSH service in the Kubernetes cluster. Does not
# require opening ports the cluster nodes and is more secure. 'portforward'
# is used as default if 'networking' is not specified.
networking: portforward
# Advanced OCI configurations (optional).
oci:
# A dict mapping region names to region-specific configurations, or `default` for the default configuration.
# A dict mapping region names to region-specific configurations, or
# `default` for the default configuration.
default:
# The OCID of the profile to use for launching instances (optional).
oci_config_profile: DEFAULT
Expand Down
19 changes: 16 additions & 3 deletions sky/backends/backend_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1023,8 +1023,8 @@ def write_cluster_config(
'SKYPILOT_USER', '')),

# AWS only:
'vpc_name': skypilot_config.get_nested(('aws', 'vpc_name'),
None),
'aws_vpc_name': skypilot_config.get_nested(('aws', 'vpc_name'),
None),
'use_internal_ips': skypilot_config.get_nested(
('aws', 'use_internal_ips'), False),
# Not exactly AWS only, but we only test it's supported on AWS
Expand All @@ -1038,6 +1038,8 @@ def write_cluster_config(
'resource_group': f'{cluster_name}-{region_name}',

# GCP only:
'gcp_vpc_name': skypilot_config.get_nested(('gcp', 'vpc_name'),
None),
'gcp_project_id': gcp_project_id,
'specific_reservations': filtered_specific_reservations,
'num_specific_reserved_workers': num_specific_reserved_workers,
Expand Down Expand Up @@ -1126,10 +1128,21 @@ def write_cluster_config(

user_file_dir = os.path.expanduser(f'{SKY_USER_FILE_PATH}/')

# We do not import the module under sky.skylet.providers globally as we
# need to avoid importing ray module (extras like skypilot[aws] has
# removed the Ray dependency).
# pylint: disable=import-outside-toplevel
from sky.skylet.providers.gcp import config as gcp_config
config = common_utils.read_yaml(os.path.expanduser(config_dict['ray']))
vpc_name = gcp_config.get_usable_vpc(config)
vpc_name = None
try:
vpc_name = gcp_config.get_usable_vpc(config)
except RuntimeError as e:
# Launching a TPU and encountering a bootstrap-phase error, no point
# in failover unless:
# TODO(zongheng): handle failover when multi-resource is added.
with ux_utils.print_exception_no_traceback():
raise e

scripts = []
for template_name in ('gcp-tpu-create.sh.j2', 'gcp-tpu-delete.sh.j2'):
Expand Down
37 changes: 29 additions & 8 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -664,16 +664,19 @@ def _update_blocklist_on_gcp_error(
self, launchable_resources: 'resources_lib.Resources',
region: 'clouds.Region', zones: Optional[List['clouds.Zone']],
stdout: str, stderr: str):

del region # unused
style = colorama.Style
assert zones and len(zones) == 1, zones
zone = zones[0]
splits = stderr.split('\n')
exception_list = [s for s in splits if s.startswith('Exception: ')]
httperror_str = [
s for s in splits
if s.startswith('googleapiclient.errors.HttpError: ')
# GCP API errors
if s.startswith('googleapiclient.errors.HttpError: ') or
# 'SKYPILOT_ERROR_NO_NODES_LAUNCHED': skypilot's changes to the
# underlying provisioner provider; for errors prior to provisioning
# like VPC setup.
'SKYPILOT_ERROR_NO_NODES_LAUNCHED: ' in s
]
if len(exception_list) == 1:
# Parse structured response {'errors': [...]}.
Expand Down Expand Up @@ -756,9 +759,21 @@ def _update_blocklist_on_gcp_error(
else:
assert False, error
elif len(httperror_str) >= 1:
logger.info(f'Got {httperror_str[0]}')
if ('Requested disk size cannot be smaller than the image size'
in httperror_str[0]):
messages = '\n\t'.join(httperror_str)
logger.warning(
f'Got error(s):\n\t{style.DIM}{messages}{style.RESET_ALL}')
if ('SKYPILOT_ERROR_NO_NODES_LAUNCHED: No VPC with name '
in stderr):
# User has specified a VPC that does not exist. On GCP, VPC is
# global. So we skip the entire cloud.
self._blocked_resources.add(
launchable_resources.copy(region=None, zone=None))
elif ('SKYPILOT_ERROR_NO_NODES_LAUNCHED: No subnet for region '
in stderr):
self._blocked_resources.add(
launchable_resources.copy(region=region.name, zone=None))
elif ('Requested disk size cannot be smaller than the image size'
in httperror_str[0]):
logger.info('Skipping all regions due to disk size issue.')
self._blocked_resources.add(
launchable_resources.copy(region=None, zone=None))
Expand All @@ -773,7 +788,6 @@ def _update_blocklist_on_gcp_error(
f'Details: {httperror_str[0]}')
self._blocked_resources.add(
launchable_resources.copy(region=None, zone=None))

else:
# Parse HttpError for unauthorized regions. Example:
# googleapiclient.errors.HttpError: <HttpError 403 when requesting ... returned "Location us-east1-d is not found or access is unauthorized.". # pylint: disable=line-too-long
Expand Down Expand Up @@ -4357,10 +4371,17 @@ def _check_existing_cluster(
# The cluster is recently terminated either by autostop or manually
# terminated on the cloud. We should use the previously terminated
# resources to provision the cluster.
#
# FIXME(zongheng): this assert can be hit by using two terminals.
# First, create a 'dbg' cluster. Then:
# Terminal 1: sky down dbg -y
# Terminal 2: sky launch -c dbg -- echo
# Run it in order. Terminal 2 will show this error after terminal 1
# succeeds in downing the cluster and releasing the lock.
assert isinstance(
handle_before_refresh, CloudVmRayResourceHandle), (
f'Trying to launch cluster {cluster_name!r} recently '
'terminated on the cloud, but the handle is not a '
'terminated on the cloud, but the handle is not a '
f'CloudVmRayResourceHandle ({handle_before_refresh}).')
status_before_refresh_str = None
if status_before_refresh is not None:
Expand Down
80 changes: 63 additions & 17 deletions sky/skylet/providers/gcp/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,16 @@
# with ServiceAccounts.


def _skypilot_log_error_and_exit_for_failover(error: str) -> None:
"""Logs an message then raises a specific RuntimeError to trigger failover.
Mainly used for handling VPC/subnet errors before nodes are launched.
"""
# NOTE: keep. The backend looks for this to know no nodes are launched.
prefix = "SKYPILOT_ERROR_NO_NODES_LAUNCHED: "
raise RuntimeError(prefix + error)


def get_node_type(node: dict) -> GCPNodeType:
"""Returns node type based on the keys in ``node``.
Expand Down Expand Up @@ -753,28 +763,56 @@ def _create_rules(config, compute, rules, VPC_NAME, PROJ_ID):
wait_for_compute_global_operation(config["provider"]["project_id"], op, compute)


def get_usable_vpc(config):
def get_usable_vpc(config) -> str:
"""Return a usable VPC.
If config['provider']['vpc_name'] is set, return the VPC with the name
(errors out if not found). When this field is set, no firewall rules
checking or overrides will take place; it is the user's responsibility to
properly set up the VPC.
If not found, create a new one with sufficient firewall rules.
Raises:
RuntimeError: if the user has specified a VPC name but the VPC is not found.
"""
_, _, compute, _ = construct_clients_from_provider_config(config["provider"])

# For backward compatibility, reuse the VPC if the VM is launched.
resource = GCPCompute(
compute,
config["provider"]["project_id"],
config["provider"]["availability_zone"],
config["cluster_name"],
)
node = resource._list_instances(label_filters=None, status_filter=None)
if len(node) > 0:
netInterfaces = node[0].get("networkInterfaces", [])
if len(netInterfaces) > 0:
vpc_name = netInterfaces[0]["network"].split("/")[-1]
return vpc_name

vpcnets_all = _list_vpcnets(config, compute)
specific_vpc_to_use = config["provider"].get("vpc_name", None)
if specific_vpc_to_use is None:
# For backward compatibility, reuse the VPC if the VM is launched.
resource = GCPCompute(
compute,
config["provider"]["project_id"],
config["provider"]["availability_zone"],
config["cluster_name"],
)
node = resource._list_instances(label_filters=None, status_filter=None)
if len(node) > 0:
netInterfaces = node[0].get("networkInterfaces", [])
if len(netInterfaces) > 0:
vpc_name = netInterfaces[0]["network"].split("/")[-1]
return vpc_name

vpcnets_all = _list_vpcnets(config, compute)
else:
vpcnets_all = _list_vpcnets(
config, compute, filter=f"name={specific_vpc_to_use}"
)
# On GCP, VPC names are unique, so it'd be 0 or 1 VPC found.
assert (
len(vpcnets_all) <= 1
), f"{len(vpcnets_all)} VPCs found with the same name {specific_vpc_to_use}"
if len(vpcnets_all) == 1:
# Skip checking any firewall rules if the user has specified a VPC.
logger.info(f"Using user-specified VPC {specific_vpc_to_use!r}.")
return specific_vpc_to_use
else:
# VPC with this name not found. Error out and let SkyPilot failover.
_skypilot_log_error_and_exit_for_failover(
f"No VPC with name {specific_vpc_to_use!r} is found. "
"To fix: specify a correct VPC name."
)
# Should not reach here.

usable_vpc_name = None
for vpc in vpcnets_all:
Expand Down Expand Up @@ -827,6 +865,14 @@ def _configure_subnet(config, compute):
# SkyPilot: make sure there's a usable VPC
usable_vpc_name = get_usable_vpc(config)
subnets = _list_subnets(config, compute, filter=f'(name="{usable_vpc_name}")')
if not subnets:
# This can happen when e.g., region A is specified but the VPC has no
# subnet in region A.
_skypilot_log_error_and_exit_for_failover(
f"No subnet for region {config['provider']['region']} found (VPC {usable_vpc_name!r}). "
f"Check the subnets of VPC {usable_vpc_name!r} at https://console.cloud.google.com/networking/networks"
)

default_subnet = subnets[0]

default_interfaces = [
Expand Down
Loading

0 comments on commit aecb2de

Please sign in to comment.