-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mods to run from Modal #3908
base: master
Are you sure you want to change the base?
Mods to run from Modal #3908
Conversation
@Michaelvll could you provide some guidance on how you'd like to integrate these changes if you're open to accepting them? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this PR @sethkimmel3! We should probably not remove the control master
for all the other cases, as that can significantly increase the time for executing commands on the remote cluster.
sky/backends/wheel_utils.py
Outdated
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))): | ||
shutil.move(str(wheel_path), wheel_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems indicating that the temp folder we are using in the code is not available for writing wheels to. Can we debug a bit for why the /tmp
folder is not working and whether it will work with tempfile.gettempdir()
instead of using the hardcoded /tmp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I revert to the original code, the error I'm getting is: shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
. So I think it's actually that the wheel is persisting, and shutil.move
fails to overwrite if it exists. Is it atypical that the wheel would already be found at the destination?
I think there's a number of ways to fix this that are relatively harmless and idempotent. This could be checking if the file already exists and skipping (as I did here), or running a shutil
method that overwrites instead. Just let me know what you're preference is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is weird. The original code's destination, wheel_dir
, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835
) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
indicates the wheel_dir
somehow becomes the target file.
IIUC, if the destination is a dir, shutil
should correctly overwrite the file in it. Could you help check if that is the case in your case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example printout of the wheel_path, and wheel_dir:
Wheel path: /tmp/tmp1ca8zhil/skypilot-1.0.0.dev0-py3-none-any.whl
Wheel dir: /root/.sky/wheels/6b3f8d33bac9d114a388c31be0e1998e
I think it's possible that the 9p filesystem is again doing something odd here. Also from observation, it starts to fail the second time (not first time) the wheel hash changes. I probably need to do more digging here myself but let me know if that's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason this is failing seems simple. It turns out that shutil.move
fails during a move across two filesystems. Since I'm moving the wheel from Modal's tmp
mount to my own .sky
mount, this makes sense. It seems like the best option is to use shutil.copy2
or check if the wheel already exists before attempting to move.
@Michaelvll I agree changing the default behavior of the control master isn't a good idea. However, modifying the behavior seems critical in this case. Do you see a good option to parameterize this, potentially in |
Another finding @Michaelvll - if I set the ssh control path to I believe this is due to Modal using a 9p filesystem under the hood, which seems to not interface well with the symlink behavior required to use ssh control master. Again, I'm not sure if this would be an acceptable change, or something that could be configurable by the user. |
sky/utils/command_runner.py
Outdated
@@ -401,6 +402,7 @@ def __init__( | |||
ssh_proxy_command: Optional[str] = None, | |||
docker_user: Optional[str] = None, | |||
disable_control_master: Optional[bool] = False, | |||
# disable_control_master: Optional[bool] = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# disable_control_master: Optional[bool] = True, |
Let's remove this unused comment : )
@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]: | |||
if ssh_control_filename is None: | |||
return None | |||
user_hash = common_utils.get_user_hash() | |||
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}' | |||
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use tempfile.gettempdir()
instead of /dev/shm/
to be more general?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like tempfile.gettempdir()
returns /tmp
when run from a Modal container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/
feels a bit losing generality for other systems, for example
def _file_system_allow_ssh_control_master() -> bool:
file_system_type = None
partitions = psutil.disk_partitions()
for partition in partitions:
if '/' in partition.mountpoint and file_system_type is None:
file_system_type = partition.fstype
if '/tmp' in partition.mountpoint:
file_system_type = partition.fstype
break
# Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
return True
return False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. I'll check on my end to make sure it detects the 9p filesystem correctly. My only concern is losing some performance benefits of the ssh control master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like your function captures the 9p filesystem for /tmp
. Since it's the only erroneous case that's specifically known as of now, would something like this work?
import subprocess
def is_tmp_9p_filesystem():
try:
# Use 'df' command to get filesystem information for /tmp
result = subprocess.run(['df', '-T', '/tmp'], capture_output=True, text=True)
if result.returncode != 0:
raise subprocess.CalledProcessError(result.returncode, result.args, result.stdout, result.stderr)
# Split the output into lines and get the second line (which contains the filesystem info)
filesystem_info = result.stdout.strip().split('\n')[1]
# The second column of this line is the filesystem type
filesystem_type = filesystem_info.split()[1]
# Check if the filesystem type is '9p'
return filesystem_type.lower() == '9p'
except subprocess.CalledProcessError as e:
print(f"Error running 'df' command: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
return False
If it's True, then we disable ssh_control_master
.
sky/backends/wheel_utils.py
Outdated
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))): | ||
shutil.move(str(wheel_path), wheel_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is weird. The original code's destination, wheel_dir
, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835
) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
indicates the wheel_dir
somehow becomes the target file.
IIUC, if the destination is a dir, shutil
should correctly overwrite the file in it. Could you help check if that is the case in your case?
@Michaelvll here's where things stand:
|
sky/backends/wheel_utils.py
Outdated
shutil.copy2(str(wheel_path), wheel_dir) | ||
# if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))): | ||
# shutil.move(str(wheel_path), wheel_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change mainly caused by sharing the ~/.sky
folder across multiple functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's because we're moving the wheel from a 9p
to tmpfs
type filesystem (or vice versa).
@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]: | |||
if ssh_control_filename is None: | |||
return None | |||
user_hash = common_utils.get_user_hash() | |||
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}' | |||
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/
feels a bit losing generality for other systems, for example
def _file_system_allow_ssh_control_master() -> bool:
file_system_type = None
partitions = psutil.disk_partitions()
for partition in partitions:
if '/' in partition.mountpoint and file_system_type is None:
file_system_type = partition.fstype
if '/tmp' in partition.mountpoint:
file_system_type = partition.fstype
break
# Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
return True
return False
@Michaelvll - I proposed an implementation here. Let me know what you think! |
The small changes in this PR were necessary to successfully deploy Skypilot from Modal. I believe that the root cause issue is a specialized filesystem that Modal uses and mounts to containers for faster provisioning and startup time.
I've marked it as a draft for now, as I'm assuming we want to let a user add a configuration, perhaps to the
config.yaml
advanced configurations to disable the ssh control master. There is an example inbackend_utils.py
that disables it for kubernetes, but I don't see a more general pattern to allow such an override:skypilot/sky/backends/backend_utils.py
Line 1273 in e3b9763
I believe the other change is relatively benign, but I'd be open to other suggestions to fix this too.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh