Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mods to run from Modal #3908

Open
wants to merge 43 commits into
base: master
Choose a base branch
from

Conversation

sethkimmel3
Copy link
Contributor

@sethkimmel3 sethkimmel3 commented Sep 4, 2024

The small changes in this PR were necessary to successfully deploy Skypilot from Modal. I believe that the root cause issue is a specialized filesystem that Modal uses and mounts to containers for faster provisioning and startup time.

I've marked it as a draft for now, as I'm assuming we want to let a user add a configuration, perhaps to the config.yaml advanced configurations to disable the ssh control master. There is an example in backend_utils.py that disables it for kubernetes, but I don't see a more general pattern to allow such an override:

credentials['disable_control_master'] = True

I believe the other change is relatively benign, but I'd be open to other suggestions to fix this too.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@sethkimmel3 sethkimmel3 changed the title Sethkimmel3 modify ssh control path Mods to run from Modal Sep 4, 2024
@sethkimmel3
Copy link
Contributor Author

@Michaelvll could you provide some guidance on how you'd like to integrate these changes if you're open to accepting them?

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this PR @sethkimmel3! We should probably not remove the control master for all the other cases, as that can significantly increase the time for executing commands on the remote cluster.

Comment on lines 132 to 133
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))):
shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems indicating that the temp folder we are using in the code is not available for writing wheels to. Can we debug a bit for why the /tmp folder is not working and whether it will work with tempfile.gettempdir() instead of using the hardcoded /tmp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I revert to the original code, the error I'm getting is: shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists. So I think it's actually that the wheel is persisting, and shutil.move fails to overwrite if it exists. Is it atypical that the wheel would already be found at the destination?

I think there's a number of ways to fix this that are relatively harmless and idempotent. This could be checking if the file already exists and skipping (as I did here), or running a shutil method that overwrites instead. Just let me know what you're preference is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is weird. The original code's destination, wheel_dir, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists indicates the wheel_dir somehow becomes the target file.

IIUC, if the destination is a dir, shutil should correctly overwrite the file in it. Could you help check if that is the case in your case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example printout of the wheel_path, and wheel_dir:
Wheel path: /tmp/tmp1ca8zhil/skypilot-1.0.0.dev0-py3-none-any.whl
Wheel dir: /root/.sky/wheels/6b3f8d33bac9d114a388c31be0e1998e

I think it's possible that the 9p filesystem is again doing something odd here. Also from observation, it starts to fail the second time (not first time) the wheel hash changes. I probably need to do more digging here myself but let me know if that's helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reason this is failing seems simple. It turns out that shutil.move fails during a move across two filesystems. Since I'm moving the wheel from Modal's tmp mount to my own .sky mount, this makes sense. It seems like the best option is to use shutil.copy2 or check if the wheel already exists before attempting to move.

@sethkimmel3
Copy link
Contributor Author

@Michaelvll I agree changing the default behavior of the control master isn't a good idea. However, modifying the behavior seems critical in this case. Do you see a good option to parameterize this, potentially in config.yaml, so the user can change the behavior if necessary?

@sethkimmel3
Copy link
Contributor Author

Another finding @Michaelvll - if I set the ssh control path to f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'rather than f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}', I can leave the default control master behavior.

I believe this is due to Modal using a 9p filesystem under the hood, which seems to not interface well with the symlink behavior required to use ssh control master. Again, I'm not sure if this would be an acceptable change, or something that could be configurable by the user.

@@ -401,6 +402,7 @@ def __init__(
ssh_proxy_command: Optional[str] = None,
docker_user: Optional[str] = None,
disable_control_master: Optional[bool] = False,
# disable_control_master: Optional[bool] = True,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# disable_control_master: Optional[bool] = True,

Let's remove this unused comment : )

@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
if ssh_control_filename is None:
return None
user_hash = common_utils.get_user_hash()
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}'
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use tempfile.gettempdir() instead of /dev/shm/ to be more general?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like tempfile.gettempdir() returns /tmp when run from a Modal container.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/ feels a bit losing generality for other systems, for example

def _file_system_allow_ssh_control_master() -> bool:
    file_system_type = None
    partitions = psutil.disk_partitions()
    for partition in partitions:
        if '/' in partition.mountpoint and file_system_type is None:
            file_system_type = partition.fstype
        if '/tmp' in partition.mountpoint:
            file_system_type = partition.fstype
            break
    # Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
    if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
        return True
    return False

Copy link
Contributor Author

@sethkimmel3 sethkimmel3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. I'll check on my end to make sure it detects the 9p filesystem correctly. My only concern is losing some performance benefits of the ssh control master.

Copy link
Contributor Author

@sethkimmel3 sethkimmel3 Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like your function captures the 9p filesystem for /tmp. Since it's the only erroneous case that's specifically known as of now, would something like this work?

import subprocess

def is_tmp_9p_filesystem():
    try:
        # Use 'df' command to get filesystem information for /tmp
        result = subprocess.run(['df', '-T', '/tmp'], capture_output=True, text=True)
        
        if result.returncode != 0:
            raise subprocess.CalledProcessError(result.returncode, result.args, result.stdout, result.stderr)

        # Split the output into lines and get the second line (which contains the filesystem info)
        filesystem_info = result.stdout.strip().split('\n')[1]
        
        # The second column of this line is the filesystem type
        filesystem_type = filesystem_info.split()[1]

        # Check if the filesystem type is '9p'
        return filesystem_type.lower() == '9p'

    except subprocess.CalledProcessError as e:
        print(f"Error running 'df' command: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    
    return False

If it's True, then we disable ssh_control_master.

Comment on lines 132 to 133
if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))):
shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is weird. The original code's destination, wheel_dir, is a directory (supposedly to be /root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835) but the error message you shown shutil.Error: Destination path '/root/.sky/wheels/0b62cd7da6552f20e52d8a83687dd835/skypilot-1.0.0.dev0-py3-none-any.whl' already exists indicates the wheel_dir somehow becomes the target file.

IIUC, if the destination is a dir, shutil should correctly overwrite the file in it. Could you help check if that is the case in your case?

@sethkimmel3
Copy link
Contributor Author

@Michaelvll here's where things stand:

  1. To solve the ssh control path issue, we can do one of two things - use a tmpfs type directory; in Modal's case /dev/shm/, as a 9p filesystem seems incompatible with using ssh control paths. So it seems viable to let a user choose an ssh control path directory in config.yaml, or let a user control whether to disable the ssh control master.
  2. The second issue stems from trying to move across filesystems. We can use shutil.copy2 or simply check in advance whether the file already exists so shutil.move doesn't fail.

Comment on lines 132 to 134
shutil.copy2(str(wheel_path), wheel_dir)
# if not os.path.exists(os.path.join(wheel_dir, os.path.basename(wheel_path))):
# shutil.move(str(wheel_path), wheel_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change mainly caused by sharing the ~/.sky folder across multiple functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's because we're moving the wheel from a 9p to tmpfs type filesystem (or vice versa).

@@ -42,7 +43,7 @@ def _ssh_control_path(ssh_control_filename: Optional[str]) -> Optional[str]:
if ssh_control_filename is None:
return None
user_hash = common_utils.get_user_hash()
path = f'/tmp/skypilot_ssh_{user_hash}/{ssh_control_filename}'
path = f'/dev/shm/skypilot_ssh_{user_hash}/{ssh_control_filename}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems control master is only supported on some specific file system types and that seems to be why it fails on modal's container. I am thinking that we should check the file system type to decide whether we should use control master. Changing the path to /dev/shm/ feels a bit losing generality for other systems, for example

def _file_system_allow_ssh_control_master() -> bool:
    file_system_type = None
    partitions = psutil.disk_partitions()
    for partition in partitions:
        if '/' in partition.mountpoint and file_system_type is None:
            file_system_type = partition.fstype
        if '/tmp' in partition.mountpoint:
            file_system_type = partition.fstype
            break
    # Allow unknown file system types, as well as ext4, xfs, btrfs, and apfs.
    if file_system_type in ['ext4', 'xfs', 'btrfs', 'apfs', None]:
        return True
    return False

@sethkimmel3 sethkimmel3 marked this pull request as ready for review October 27, 2024 23:51
@sethkimmel3
Copy link
Contributor Author

@Michaelvll - I proposed an implementation here. Let me know what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants