-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mods to run from Modal #3908
base: master
Are you sure you want to change the base?
Mods to run from Modal #3908
Changes from 17 commits
d69848a
df30ac7
8d93356
f27df54
5be67a7
df2bc33
b69c583
7c57122
dab6ab3
d1b7b67
fb1b621
9395450
03d9c32
1805834
c60a5b5
a24b2df
5de6aee
e278cab
272f6ef
f234c5b
0ff8ba1
4246672
ed4570f
6ab5f31
0c8205f
c4c7c2c
a9fff3c
ff9b36a
7aa32ee
a22c884
48fe52b
6120952
63c7eb1
36dcbcb
f288ebe
5d04ba8
e1793be
468cf51
742875a
3613c38
7a1bdc3
dfd6e33
0386507
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -6,6 +6,7 @@ | |||
import shlex | ||||
import time | ||||
from typing import Any, Callable, Iterable, List, Optional, Tuple, Type, Union | ||||
import tempfile | ||||
|
||||
from sky import sky_logging | ||||
from sky.skylet import constants | ||||
|
@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]: | |||
if ssh_control_filename is None: | ||||
return None | ||||
user_hash = common_utils.get_user_hash() | ||||
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}' | ||||
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}' | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible to use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to def _file_system_allow_ssh_control_master() -> bool:
file_system_type = None
partitions = psutil.disk_partitions()
for partition in partitions:
if '/' in partition.mountpoint and file_system_type is None:
file_system_type = partition.fstype
if '/tmp' in partition.mountpoint:
file_system_type = partition.fstype
break
# Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
return True
return False There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this makes sense. I'll check on my end to make sure it detects the 9p filesystem correctly. My only concern is losing some performance benefits of the ssh control master. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't look like your function captures the 9p filesystem for import subprocess
def is_tmp_9p_filesystem():
try:
# Use 'df' command to get filesystem information for /tmp
result = subprocess.run(['df', '-T', '/tmp'], capture_output=True, text=True)
if result.returncode != 0:
raise subprocess.CalledProcessError(result.returncode, result.args, result.stdout, result.stderr)
# Split the output into lines and get the second line (which contains the filesystem info)
filesystem_info = result.stdout.strip().split('\n')[1]
# The second column of this line is the filesystem type
filesystem_type = filesystem_info.split()[1]
# Check if the filesystem type is '9p'
return filesystem_type.lower() == '9p'
except subprocess.CalledProcessError as e:
print(f"Error running 'df' command: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
return False If it's True, then we disable |
||||
os.makedirs(path, exist_ok=True) | ||||
return path | ||||
|
||||
|
@@ -401,6 +402,7 @@ def __init__( | |||
ssh_proxy_command: Optional[str] = None, | ||||
docker_user: Optional[str] = None, | ||||
disable_control_master: Optional[bool] = False, | ||||
# disable_control_master: Optional[bool] = True, | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Let's remove this unused comment : ) |
||||
): | ||||
"""Initialize SSHCommandRunner. | ||||
|
||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems indicating that the temp folder we are using in the code is not available for writing wheels to. Can we debug a bit for why the
/tmp
folder is not working and whether it will work withtempfile.gettempdir()
instead of using the hardcoded/tmp
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I revert to the original code, the error I'm getting is:
shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
. So I think it's actually that the wheel is persisting, andshutil.move
fails to overwrite if it exists. Is it atypical that the wheel would already be found at the destination?I think there's a number of ways to fix this that are relatively harmless and idempotent. This could be checking if the file already exists and skipping (as I did here), or running a
shutil
method that overwrites instead. Just let me know what you're preference is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is weird. The original code's destination,
wheel_dir
, is a directory (supposedly to be/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835
) but the error message you shownshutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists
indicates thewheel_dir
somehow becomes the target file.IIUC, if the destination is a dir,
shutil
should correctly overwrite the file in it. Could you help check if that is the case in your case?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example printout of the wheel_path, and wheel_dir:
Wheel path: /tmp/tmp1ca8zhil/skypilot-1.0.0.dev0-py3-none-any.whl
Wheel dir: /root/.sky/wheels/6b3f8d33bac9d114a388c31be0e1998e
I think it's possible that the 9p filesystem is again doing something odd here. Also from observation, it starts to fail the second time (not first time) the wheel hash changes. I probably need to do more digging here myself but let me know if that's helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the reason this is failing seems simple. It turns out that
shutil.move
fails during a move across two filesystems. Since I'm moving the wheel from Modal'stmp
mount to my own.sky
mount, this makes sense. It seems like the best option is to useshutil.copy2
or check if the wheel already exists before attempting to move.