Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ux] add sky jobs launch --fast #4231

Merged
merged 3 commits into from
Oct 31, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 12 additions & 1 deletion sky/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3560,6 +3560,15 @@ def jobs():
default=False,
required=False,
help='Skip confirmation prompt.')
# TODO(cooperc): remove this flag once --fast can robustly detect cluster
# yaml config changes
@click.option('--fast',
default=False,
is_flag=True,
help='[Experimental] Launch the job more quickly, but skip some '
'initialization steps. If you update SkyPilot or your local '
'cloud credentials, they will not be reflected until you run '
'`sky jobs launch` at least once without this flag.')
cg505 marked this conversation as resolved.
Show resolved Hide resolved
@timeline.event
@usage_lib.entrypoint
def jobs_launch(
Expand All @@ -3586,6 +3595,7 @@ def jobs_launch(
detach_run: bool,
retry_until_up: bool,
yes: bool,
fast: bool,
):
"""Launch a managed job from a YAML or a command.

Expand Down Expand Up @@ -3669,7 +3679,8 @@ def jobs_launch(
managed_jobs.launch(dag,
name,
detach_run=detach_run,
retry_until_up=retry_until_up)
retry_until_up=retry_until_up,
fast=fast)


@jobs.command('queue', cls=_DocumentedCodeCommand)
Expand Down
6 changes: 6 additions & 0 deletions sky/jobs/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def launch(
stream_logs: bool = True,
detach_run: bool = False,
retry_until_up: bool = False,
fast: bool = False,
) -> None:
# NOTE(dev): Keep the docstring consistent between the Python API and CLI.
"""Launch a managed job.
Expand All @@ -47,11 +48,15 @@ def launch(
managed job.
name: Name of the managed job.
detach_run: Whether to detach the run.
fast: Whether to use sky.launch(fast=True) for the jobs controller. If
True, the SkyPilot wheel and the cloud credentials may not be updated
on the jobs controller.

Raises:
ValueError: cluster does not exist. Or, the entrypoint is not a valid
chain dag.
sky.exceptions.NotSupportedError: the feature is not supported.

cg505 marked this conversation as resolved.
Show resolved Hide resolved
"""
entrypoint = task
dag_uuid = str(uuid.uuid4().hex[:4])
Expand Down Expand Up @@ -138,6 +143,7 @@ def launch(
idle_minutes_to_autostop=skylet_constants.
CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP,
retry_until_up=True,
fast=fast,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with this script:

for i in {1..5}; do
  sky jobs launch -y --fast --cpus 2+ -- echo hi2 &
done
wait

The last job failed with FAILED_CONTROLLER. Have you seen this before? https://gist.github.com/romilbhardwaj/7d1871f1c18b3bb0ccd9141e14bd9fdd

Copy link
Collaborator Author

@cg505 cg505 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see this. Was able to run seq 100 | xargs -P 5 -n 1 bash -c 'sky jobs launch --fast -yd -n parallel-launch-$0 "echo $0"' without any issue.
It kind of looks like the controller just died while starting the job. Not sure what would cause this.

_disable_controller_check=True)


Expand Down
Loading