Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ux] add sky jobs launch --fast #4231

Merged
merged 3 commits into from
Oct 31, 2024
Merged

Conversation

cg505
Copy link
Collaborator

@cg505 cg505 commented Oct 31, 2024

This flag will make the jobs controller launch use sky launch --fast. There
are a few known situations where this can cause misbehavior in the jobs
controller:

  • The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a
    version upgrade).
  • The user's cloud credentials have changed. In this case the new credentials
    will not be synced, and if there are new clouds available in sky check, the
    cloud depedencies may not be correctly installed.

However, this does speed up jobs launch significantly, so provide it as a
dangerous option. Soon we will add robustness checks to sky launch --fast that
will fix the above caveats, and we can remove this flag and just enable the
behavior by default.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual tests
  • Relevant individual smoke tests: `pytest tests/test_smoke.py::test_managed_jobs
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

This flag will make the jobs controller launch use `sky launch --fast`. There
are a few known situations where this can cause misbehavior in the jobs
controller:
- The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a
  version upgrade).
- The user's cloud credentials have changed. In this case the new credentials
  will not be synced, and if there are new clouds available in `sky check`, the
  cloud depedencies may not be correctly installed.

However, this does speed up `jobs launch` _significantly_, so provide it as a
dangerous option. Soon we will add robustness checks to `sky launch --fast` that
will fix the above caveats, and we can remove this flag and just enable the
behavior by default.
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cg505!

sky/cli.py Outdated Show resolved Hide resolved
sky/jobs/core.py Outdated Show resolved Hide resolved
@@ -138,6 +143,7 @@ def launch(
idle_minutes_to_autostop=skylet_constants.
CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP,
retry_until_up=True,
fast=fast,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried with this script:

for i in {1..5}; do
  sky jobs launch -y --fast --cpus 2+ -- echo hi2 &
done
wait

The last job failed with FAILED_CONTROLLER. Have you seen this before? https://gist.github.com/romilbhardwaj/7d1871f1c18b3bb0ccd9141e14bd9fdd

Copy link
Collaborator Author

@cg505 cg505 Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not see this. Was able to run seq 100 | xargs -P 5 -n 1 bash -c 'sky jobs launch --fast -yd -n parallel-launch-$0 "echo $0"' without any issue.
It kind of looks like the controller just died while starting the job. Not sure what would cause this.

@romilbhardwaj
Copy link
Collaborator

romilbhardwaj commented Oct 31, 2024

Submitted 10 jobs in ~40s - nice!

for i in {1..10}; do
  sky jobs launch -d -y --fast --cpus 2+ -- echo hi2 &
done
wait

However, the controller runs only first few jobs then fails. Probably unrelated to this PR:

Managed jobs
No in-progress managed jobs.
ID  TASK  NAME     RESOURCES   SUBMITTED    TOT. DURATION  JOB DURATION  #RECOVERIES  STATUS
17  -     sky-cmd  1x[CPU:2+]  5 mins ago   4m 50s         -             0            FAILED_CONTROLLER
16  -     sky-cmd  1x[CPU:2+]  5 mins ago   4m 54s         -             0            FAILED_CONTROLLER
15  -     sky-cmd  1x[CPU:2+]  5 mins ago   5m 5s          -             0            FAILED_CONTROLLER
14  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 9s          -             0            FAILED_CONTROLLER
13  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 15s         -             0            FAILED_CONTROLLER
12  -     sky-cmd  1x[CPU:2+]  6 mins ago   5m 24s         -             0            FAILED_CONTROLLER
11  -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 1s          5s            0            SUCCEEDED
10  -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 2s          5s            0            SUCCEEDED
9   -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 4s          6s            0            SUCCEEDED
8   -     sky-cmd  1x[CPU:2+]  6 mins ago   1m 13s         6s            0            SUCCEEDED
7   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 47s         -             0            FAILED_CONTROLLER
6   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 1s          5s            0            SUCCEEDED
5   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 2s          5s            0            SUCCEEDED
4   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 3s          5s            0            SUCCEEDED
3   -     sky-cmd  1x[CPU:2+]  13 mins ago  1m 5s          5s            0            SUCCEEDED
2   -     sky-cmd  1x[CPU:1+]  18 mins ago  58s            4s            0            SUCCEEDED
1   -     sky-cmd  1x[CPU:1+]  21 mins ago  1m 12s         4s            0            SUCCEEDED

sky jobs logs --controller isn't very helpful:

(base) ➜  ~ sky jobs logs --controller 16
D 10-31 12:29:36 skypilot_config.py:228] Using config path: /Users/romilb/.sky/config.yaml
D 10-31 12:29:36 skypilot_config.py:233] Config loaded:
D 10-31 12:29:36 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
D 10-31 12:29:36 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
D 10-31 12:29:36 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
D 10-31 12:29:36 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
D 10-31 12:29:36 skypilot_config.py:245] Config syntax check passed.
D 10-31 12:29:37 backend_utils.py:1937] Refreshing status: Failed get the lock for cluster 'sky-jobs-controller-2ea485ea'. Using the cached status.
├── Waiting for task resources on 1 node.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:01 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:228] Using config path: /home/gcpuser/.sky/managed_jobs/sky-cmd-3008.config_yaml
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] Config loaded:
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233] {'allowed_clouds': ['aws', 'gcp'],
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]  'jobs': {'controller': {'resources': {'cpus': '4+', 'memory': '4+'}}},
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]  'kubernetes': {'pod_config': {'spec': {'containers': [{'env': [{'name': 'MY_ENV_VAR',
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:233]                                                                  'value': 'my_value'}]}]}}}}
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:245] Config syntax check passed.
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] DAG:
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53] [Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:53]   resources: <Cloud>(cpus=2+)]
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:180] Submitted managed job 16 (task: 0, name: 'sky-cmd'); SKYPILOT_TASK_ID: sky-managed-2024-10-31-19-22-02-711936_sky-cmd_16-0
(sky-cmd, pid=15103) I 10-31 19:22:02 controller.py:184] Started monitoring.
(sky-cmd, pid=15103) I 10-31 19:22:02 state.py:337] Launching the spot cluster...
(sky-cmd, pid=15103) D 10-31 19:22:02 skypilot_config.py:146] User config: allowed_clouds -> ['aws', 'gcp']
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292] #### Task<name=sky-cmd>(run='echo hi2')
(sky-cmd, pid=15103) D 10-31 19:22:02 optimizer.py:292]   resources: <Cloud>(cpus=2+) ####

Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed and tested offline with 100 jobs submitted in parallel. Discovered other bottlenecks unrelated to this PR, which we should file as issues + suggest best practices using xargs to limit parallelism.

@cg505 cg505 enabled auto-merge October 31, 2024 21:04
@cg505 cg505 added this pull request to the merge queue Oct 31, 2024
Merged via the queue into skypilot-org:master with commit 599e155 Oct 31, 2024
20 checks passed
@cg505 cg505 deleted the fast-jobs-launch branch October 31, 2024 21:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants