Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mods to run from Modal #3908

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
d69848a
Update command_runner.py
sethkimmel3 Sep 3, 2024
df30ac7
Update command_runner.py
sethkimmel3 Sep 3, 2024
8d93356
Update wheel_utils.py
sethkimmel3 Sep 3, 2024
f27df54
Update command_runner.py
sethkimmel3 Sep 3, 2024
5be67a7
Merge branch 'skypilot-org:master' into sethkimmel3-modify-ssh-contro…
sethkimmel3 Sep 5, 2024
df2bc33
Merge branch 'skypilot-org:master' into sethkimmel3-modify-ssh-contro…
sethkimmel3 Sep 9, 2024
b69c583
Merge branch 'skypilot-org:master' into sethkimmel3-modify-ssh-contro…
sethkimmel3 Sep 20, 2024
7c57122
Update wheel_utils.py
sethkimmel3 Sep 20, 2024
dab6ab3
Update wheel_utils.py
sethkimmel3 Sep 20, 2024
d1b7b67
Update command_runner.py
sethkimmel3 Sep 20, 2024
fb1b621
Update command_runner.py
sethkimmel3 Sep 20, 2024
9395450
Update command_runner.py
sethkimmel3 Sep 20, 2024
03d9c32
Update command_runner.py
sethkimmel3 Sep 20, 2024
1805834
Update command_runner.py
sethkimmel3 Sep 20, 2024
c60a5b5
Update command_runner.py
sethkimmel3 Sep 20, 2024
a24b2df
Update command_runner.py
sethkimmel3 Sep 20, 2024
5de6aee
Update command_runner.py
sethkimmel3 Sep 20, 2024
e278cab
Update command_runner.py
sethkimmel3 Sep 22, 2024
272f6ef
Update command_runner.py
sethkimmel3 Sep 22, 2024
f234c5b
Update wheel_utils.py
sethkimmel3 Sep 22, 2024
0ff8ba1
Update wheel_utils.py
sethkimmel3 Sep 22, 2024
4246672
Update wheel_utils.py
sethkimmel3 Sep 22, 2024
ed4570f
Update command_runner.py
sethkimmel3 Sep 23, 2024
6ab5f31
Update wheel_utils.py
sethkimmel3 Sep 23, 2024
0c8205f
Update wheel_utils.py
sethkimmel3 Sep 23, 2024
c4c7c2c
Update wheel_utils.py
sethkimmel3 Sep 23, 2024
a9fff3c
Update wheel_utils.py
sethkimmel3 Sep 23, 2024
ff9b36a
Update wheel_utils.py
sethkimmel3 Sep 23, 2024
7aa32ee
Update wheel_utils.py
sethkimmel3 Sep 23, 2024
a22c884
Merge branch 'skypilot-org:master' into sethkimmel3-modify-ssh-contro…
sethkimmel3 Sep 30, 2024
48fe52b
Merge branch 'skypilot-org:master' into sethkimmel3-modify-ssh-contro…
sethkimmel3 Oct 17, 2024
6120952
Create control_master_checks.py
sethkimmel3 Oct 27, 2024
63c7eb1
Update command_runner.py
sethkimmel3 Oct 27, 2024
36dcbcb
Update wheel_utils.py
sethkimmel3 Oct 27, 2024
f288ebe
Merge branch 'skypilot-org:master' into sethkimmel3-modify-ssh-contro…
sethkimmel3 Oct 27, 2024
5d04ba8
fix import
sethkimmel3 Oct 27, 2024
e1793be
fix some pylint stuff
sethkimmel3 Oct 27, 2024
468cf51
more pylint
sethkimmel3 Oct 28, 2024
742875a
more pylint
sethkimmel3 Oct 28, 2024
3613c38
more pylint
sethkimmel3 Oct 28, 2024
7a1bdc3
yapf'
sethkimmel3 Oct 28, 2024
dfd6e33
reorder imports
sethkimmel3 Oct 28, 2024
0386507
fix
sethkimmel3 Oct 28, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion sky/backends/wheel_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,9 @@ def _build_sky_wheel() -> pathlib.Path:

wheel_dir = WHEEL_DIR / hash_of_latest_wheel
wheel_dir.mkdir(parents=True, exist_ok=True)
shutil.move(str(wheel_path), wheel_dir)
# shutil.move(str(wheel_path), wheel_dir)
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))):
shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems indicating that the temp folder we are using in the code is not available for writing wheels to. Can we debug a bit for why the /tmp folder is not working and whether it will work with tempfile.gettempdir() instead of using the hardcoded /tmp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I revert to the original code, the error I'm getting is: shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists. So I think it's actually that the wheel is persisting, and shutil.move fails to overwrite if it exists. Is it atypical that the wheel would already be found at the destination?

I think there's a number of ways to fix this that are relatively harmless and idempotent. This could be checking if the file already exists and skipping (as I did here), or running a shutil method that overwrites instead. Just let me know what you're preference is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is weird. The original code's destination, wheel_dir, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists indicates the wheel_dir somehow becomes the target file.

IIUC, if the destination is a dir, shutil should correctly overwrite the file in it. Could you help check if that is the case in your case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example printout of the wheel_path, and wheel_dir:
Wheel path: /tmp/tmp1ca8zhil/skypilot-1.0.0.dev0-py3-none-any.whl
Wheel dir: /root/.sky/wheels/6b3f8d33bac9d114a388c31be0e1998e

I think it's possible that the 9p filesystem is again doing something odd here. Also from observation, it starts to fail the second time (not first time) the wheel hash changes. I probably need to do more digging here myself but let me know if that's helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason this is failing seems simple. It turns out that shutil.move fails during a move across two filesystems. Since I'm moving the wheel from Modal's tmp mount to my own .sky mount, this makes sense. It seems like the best option is to use shutil.copy2 or check if the wheel already exists before attempting to move.

return wheel_dir / wheel_path.name


Expand Down
4 changes: 3 additions & 1 deletion sky/utils/command_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import shlex
import time
from typing import Any, Callable, Iterable, List, Optional, Tuple, Type, Union
import tempfile

from sky import sky_logging
from sky.skylet import constants
Expand Down Expand Up @@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
if ssh_control_filename is None:
return None
user_hash = common_utils.get_user_hash()
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}'
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use tempfile.gettempdir() instead of /dev/shm/ to be more general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tempfile.gettempdir() returns /tmp when run from a Modal container.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/ feels a bit losing generality for other systems, for example

def _file_system_allow_ssh_control_master() -> bool:
    file_system_type = None
    partitions = psutil.disk_partitions()
    for partition in partitions:
        if '/' in partition.mountpoint and file_system_type is None:
            file_system_type = partition.fstype
        if '/tmp' in partition.mountpoint:
            file_system_type = partition.fstype
            break
    # Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
    if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
        return True
    return False

Copy link
Contributor Author

@sethkimmel3 sethkimmel3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I'll check on my end to make sure it detects the 9p filesystem correctly. My only concern is losing some performance benefits of the ssh control master.

Copy link
Contributor Author

@sethkimmel3 sethkimmel3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like your function captures the 9p filesystem for /tmp. Since it's the only erroneous case that's specifically known as of now, would something like this work?

import subprocess

def is_tmp_9p_filesystem():
    try:
        # Use 'df' command to get filesystem information for /tmp
        result = subprocess.run(['df', '-T', '/tmp'], capture_output=True, text=True)
        
        if result.returncode != 0:
            raise subprocess.CalledProcessError(result.returncode, result.args, result.stdout, result.stderr)

        # Split the output into lines and get the second line (which contains the filesystem info)
        filesystem_info = result.stdout.strip().split('\n')[1]
        
        # The second column of this line is the filesystem type
        filesystem_type = filesystem_info.split()[1]

        # Check if the filesystem type is '9p'
        return filesystem_type.lower() == '9p'

    except subprocess.CalledProcessError as e:
        print(f"Error running 'df' command: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    
    return False

If it's True, then we disable ssh_control_master.

os.makedirs(path, exist_ok=True)
return path

Expand Down Expand Up @@ -401,6 +402,7 @@ def __init__(
ssh_proxy_command: Optional[str] = None,
docker_user: Optional[str] = None,
disable_control_master: Optional[bool] = False,
# disable_control_master: Optional[bool] = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# disable_control_master: Optional[bool] = True,

Let's remove this unused comment : )

):
"""Initialize SSHCommandRunner.

Expand Down
Loading