-
Notifications
You must be signed in to change notification settings - Fork 498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default to ubuntu for GCP and avoid key pair checking #1641
Conversation
The deep learning ubuntu image on GCP does not work well, as the ssh connection will be disconnected when the VM is firstly booted. Here are several things we have tried:
Due to the investigation above, we decided to defer this PR and probably host our own image in the future instead (e.g. we can create an image based on the ubuntu-os-cloud). |
Just figured out that the interruption of the ssh was caused by the async installation of the |
8d67b5f
to
8aed703
Compare
use ubuntu fix Disable control master for GCP fix test no control master and longer alive interval retry run command runner with retry add sleep retry for ray status Longer timeout avoid control persist format set ClientAliveInterval revert command runner revert command_runner fix ssh disconnection issue format fallback to debian for K80 fix k80 Backward compatibility for old debian based images minor fix Fix TPU VM add clientaliveinterval add markers for specific clouds uninstall before install uninstall -y azure uninstall longer wait time remove apt related setting fix ssh name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Michaelvll! Still need to read the PR. But I just tried it out with a previously launched & autostopped GCP controller, and got:
» sky spot launch echo hi --cloud gcp
...
I 05-14 09:33:18 cloud_vm_ray_backend.py:1033] Cluster 'sky-spot-controller-8a3968f2' (status: STOPPED) was previously launched in GCP us-central1. Relaunching in that region.
...
File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/backends/cloud_vm_ray_backend.py", line 1259, in _retry_zones
keep_launch_fields_in_existing_config=cluster_exists)
File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/utils/common_utils.py", line 241, in _record
return f(*args, **kwargs)
File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/backends/backend_utils.py", line 797, in write_cluster_config
to_provision, region, zones)
File "/Users/zongheng/Dropbox/workspace/riselab/sky-computing/sky/clouds/gcp.py", line 356, in make_deploy_resources_variables
assert image_id is not None, (image_id, r)
AssertionError: (None, GCP(n2-standard-4, disk_size=50))
sky/templates/aws-ray.yml.j2
Outdated
@@ -177,7 +177,7 @@ setup_commands: | |||
(type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; | |||
(which conda > /dev/null 2>&1 && conda init > /dev/null) || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); | |||
source ~/.bashrc; | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 uninstall -y ray ray-cpp && pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: since we do a grep ray ... || ...
, this means ray
is not found. Why is pip3 uninstall necessary? Why add ray-cpp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the grep {{ray_version}}
to check the correct ray version is installed on the remote VM. It is possible that ray with another version is pre-installed on the VM. For example, the ubuntu image on GCP has the ray==2.4.0
installed, which will cause pip
problem if we directly pip install -U ray[default]==2.0.1
(causing the ray package corrupted.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Can we copy this comment into the j2 files? Something like "Ensure only one Ray version (which is our ray_version) is installed, regardless of if the image comes pre-installed with another Ray version."
sky/clouds/gcp.py
Outdated
# We use the debian image, as the ubuntu image has some connectivity | ||
# issue when first booted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remnant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using ubuntu image now, we use grep ubuntu-2004
to find the correct images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, meant that
# We use the debian image, as the ubuntu image has some connectivity
# issue when first booted.
could be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, good catch! Removed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @concretevitamin! This PR requires the PR for adding the ubuntu image in GCP catalog skypilot-org/skypilot-catalog#25. I am planning to increase the version number of the catalog, to make sure users will download the latest catalog automatically to avoid the issue of staled catalog.
sky/clouds/gcp.py
Outdated
# We use the debian image, as the ubuntu image has some connectivity | ||
# issue when first booted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are using ubuntu image now, we use grep ubuntu-2004
to find the correct images.
sky/templates/aws-ray.yml.j2
Outdated
@@ -177,7 +177,7 @@ setup_commands: | |||
(type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; | |||
(which conda > /dev/null 2>&1 && conda init > /dev/null) || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); | |||
source ~/.bashrc; | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 uninstall -y ray ray-cpp && pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the grep {{ray_version}}
to check the correct ray version is installed on the remote VM. It is possible that ray with another version is pre-installed on the VM. For example, the ubuntu image on GCP has the ray==2.4.0
installed, which will cause pip
problem if we directly pip install -U ray[default]==2.0.1
(causing the ray package corrupted.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, haven't got the chance to test it due to the V6 catalog.
sky/clouds/gcp.py
Outdated
if image_id.startswith('skypilot:'): | ||
return DEFAULT_GCP_IMAGE_GB | ||
@classmethod | ||
def get_image_infos(cls, image_id) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: get_image_info() and s/infos/info elsewhere?
def get_image_size(self, image_id: str, region: Optional[str]) -> float: | ||
del region # unused | ||
if image_id.startswith('skypilot:'): | ||
return DEFAULT_GCP_IMAGE_GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: how do we guarantee that the ubuntu & debian tags have the same size, DEFAULT_GCP_IMAGE_GB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image size can be got using gcloud compute images describe projects/deeplearning-platform-release/global/images/common-cu113-v20230501-ubuntu-2004-py37
, and it seems both of them have the same size of 50GB. Added a comment for the hack.
sky/clouds/gcp.py
Outdated
if 'diskSizeGb' not in image_infos: | ||
return DEFAULT_GCP_IMAGE_GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reading get_image_size() for the first time and this seems a bit unclear. If user passes in a custom image that does not have diskSizeGb, returning a default 50GB seems like a guess (maybe ok), rather than we the func name suggests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the images should have that field, but this is just a safeguard to avoid the function raising the error, as the image size check is not critical. Added a comment.
sky/clouds/gcp.py
Outdated
# We use the debian image, as the ubuntu image has some connectivity | ||
# issue when first booted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, meant that
# We use the debian image, as the ubuntu image has some connectivity
# issue when first booted.
could be removed.
sky/templates/gcp-ray.yml.j2
Outdated
@@ -170,8 +218,8 @@ setup_commands: | |||
test -f /home/gcpuser/miniconda3/etc/profile.d/conda.sh && source /home/gcpuser/miniconda3/etc/profile.d/conda.sh && conda activate base || true; | |||
pip3 install --upgrade google-api-python-client; | |||
{%- endif %} | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null || pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; | |||
(pip3 list | grep skypilot && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[gcp]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1); | |||
(pip3 list | grep ray | grep {{ray_version}} 2>&1 > /dev/null && python3 -c "import ray" || pip3 uninstall -y ray ray-cpp && pip3 install -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: GCP j2 has an extra "python3 -c "import ray" vs. other j2's. Any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is because the reboot caused by the nvidia-driver
installation can cause the pip install
interrupted during execution, leading to a corrupted installation of ray
. The python3 -c "import ray"
is to check the integrity of the ray package installation. Added a comment above. Thanks!
sky/templates/gcp-ray.yml.j2
Outdated
# Magic number for waiting the nvidia driver to be ready and the instance | ||
# to be rebooted. | ||
sleep 18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Magic number for waiting the nvidia driver to be ready and the instance | |
# to be rebooted. | |
sleep 18 | |
# Magic number for waiting for the nvidia driver to be ready and the instance | |
# to be rebooted. | |
sleep 18 |
By magic number, do you mean it's arbitrarily chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, sorry for the confusion. The number was chosen by binary search, finding the smallest number that will make sure the ssh connection issue not happens for the later commands.
head_start_ray_commands: | ||
# Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before sky is installed.) | ||
# NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait. | ||
# Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires | ||
# all the sessions to be reloaded. This is a workaround. | ||
{%- if gpu is not none %} | ||
- | | ||
echo "Installing NVIDIA GPU driver." >> ~/.sky/nvlog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason we need to do this some peculiarities of the ubuntu image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the reason we need this is that for the ubuntu image, after the nvidia-driver is installed, the machine will be rebooted, causing the ssh connection issue for the later user script, and the ray cluster not existing problem. We have this here to make sure the reboot is finished before we proceed to the next stages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized I have some pending comments. Will merge the master branch and test this PR with smoke test again:
-
pytest tests/test_smoke.py
def get_image_size(self, image_id: str, region: Optional[str]) -> float: | ||
del region # unused | ||
if image_id.startswith('skypilot:'): | ||
return DEFAULT_GCP_IMAGE_GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The image size can be got using gcloud compute images describe projects/deeplearning-platform-release/global/images/common-cu113-v20230501-ubuntu-2004-py37
, and it seems both of them have the same size of 50GB. Added a comment for the hack.
sky/clouds/gcp.py
Outdated
if 'diskSizeGb' not in image_infos: | ||
return DEFAULT_GCP_IMAGE_GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the images should have that field, but this is just a safeguard to avoid the function raising the error, as the image size check is not critical. Added a comment.
sky/clouds/gcp.py
Outdated
# We use the debian image, as the ubuntu image has some connectivity | ||
# issue when first booted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, good catch! Removed. Thanks!
head_start_ray_commands: | ||
# Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before sky is installed.) | ||
# NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait. | ||
# Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires | ||
# all the sessions to be reloaded. This is a workaround. | ||
{%- if gpu is not none %} | ||
- | | ||
echo "Installing NVIDIA GPU driver." >> ~/.sky/nvlog |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the reason we need this is that for the ubuntu image, after the nvidia-driver is installed, the machine will be rebooted, causing the ssh connection issue for the later user script, and the ray cluster not existing problem. We have this here to make sure the reboot is finished before we proceed to the next stages.
sky/templates/gcp-ray.yml.j2
Outdated
# Magic number for waiting the nvidia driver to be ready and the instance | ||
# to be rebooted. | ||
sleep 18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, sorry for the confusion. The number was chosen by binary search, finding the smallest number that will make sure the ssh connection issue not happens for the later commands.
{%- if gpu is not none %} | ||
{%- if 'tesla-k80' in gpu %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, good catch! It should be if... else
. Will run the tests again.
sky/templates/gcp-ray.yml.j2
Outdated
- path: /etc/ssh/sshd_config | ||
append: true | ||
content: | | ||
ClientAliveInterval 720 | ||
ClientAliveCountMax 720 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, that is a remnant. Just tested it again without it, and it seems working well. Removed. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this feature! 🚀 Just went through the code and it looks good to me except for some minor details. Left some comments here:
SSHCommandRunner, | ||
) | ||
|
||
class SSHCommandRunnerWithRetry(SSHCommandRunner): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider moving this to sky/skylet/providers/command_runner.py
after #1910 is merged?
if _image_df[_image_df['Tag'] == 'skypilot:cpu-ubuntu-2004'].empty: | ||
# Update the image catalog if it does not include the updated images | ||
# https://github.com/skypilot-org/skypilot-catalog/pull/25. | ||
_image_df = common.read_catalog('gcp/images.csv', pull_frequency_hours=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, after increasing our version number, we don't need this right?
@@ -187,7 +186,7 @@ setup_commands: | |||
(type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; | |||
(which conda > /dev/null 2>&1 && conda init > /dev/null) || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b && eval "$(~/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); | |||
source ~/.bashrc; | |||
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; | |||
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 uninstall -y ray ray-cpp && pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a comment like sky/templates/gcp-ray.yml.j2
, L192?
@@ -126,7 +126,7 @@ setup_commands: | |||
(type -a pip | grep -q pip3) || echo 'alias pip=pip3' >> ~/.bashrc; | |||
which conda > /dev/null 2>&1 || (wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh -b && eval "$(/home/azureuser/miniconda3/bin/conda shell.bash hook)" && conda init && conda config --set auto_activate_base true); | |||
source ~/.bashrc; | |||
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; | |||
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 uninstall -y ray ray-cpp && pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app && touch ~/.sudo_as_admin_successful; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as before
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null || pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; | ||
(pip3 list | grep "skypilot " && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[gcp]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1); | ||
(pip3 list | grep "ray " | grep {{ray_version}} 2>&1 > /dev/null && python3 -c "import ray" || pip3 uninstall -y ray ray-cpp && pip3 install --exists-action w -U ray[default]=={{ray_version}}) && mkdir -p ~/sky_workdir && mkdir -p ~/.sky/sky_app; | ||
(pip3 list | grep "skypilot " && [ "$(cat {{sky_remote_path}}/current_sky_wheel_hash)" == "{{sky_wheel_hash}}" ]) && python3 -c "import sky" || (pip3 uninstall skypilot -y; pip3 install "$(echo {{sky_remote_path}}/{{sky_wheel_hash}}/skypilot-{{sky_version}}*.whl)[gcp]" && echo "{{sky_wheel_hash}}" > {{sky_remote_path}}/current_sky_wheel_hash || exit 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need import ray
here? IIUC ray with correct version is installed in L211?
This PR is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days. |
This PR was closed because it has been stalled for 10 days with no activity. |
Prerequisite: skypilot-org/skypilot-catalog#25
Main changes:
Before the fix, the ssh connection will be disconnected when ray trying to setup the runtime dependencies, which is probably because the ssh connection is unstable when the cluster is just provisioned [autoscaler][gcp] "ray up" on GCP isn't working ray-project/ray#16539 (comment). We added retry for the ssh commands executed by
ray up
, which is ok since our setup commands are idempotent.apt install
(Note: cloud-init is not installed on debian image, so we could not do it for our previous debian image)This PR may cause some backward compatibility issue for GCP user who created images from our previous debian-based image, as the cloud-init is not installed on those images. Trying to fix that now.It should now be fixed.A good thing about the new image is that the
ray==2.4.0
is installed by default in the image, the provision time would be faster if #1734 is merged.Tested (run the relevant ones):
sky launch --cloud gcp --gpus t4 test.yaml
sky launch --cloud gcp --gpus k80 test.yaml
sky launch -c test-new-image-1 --cloud gcp echo hi
sky launch -c test-old-image --cloud gcp echo hi --image-id projects/deeplearning-platform-release/global/images/common-cpu-v20221215
for i in
seq 5; do pytest tests/test_smoke.py::test_cancel_pytorch --terminate-on-failure; done
makes sure that the wait time is enough (the magic number in thehead_start_ray_commands
)pytest tests/test_smoke.py
pytest tests/test_smoke.py --aws
bash tests/backward_comaptibility_tests.sh