-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ux] add sky jobs launch --fast #4231
Conversation
This flag will make the jobs controller launch use `sky launch --fast`. There are a few known situations where this can cause misbehavior in the jobs controller: - The SkyPilot wheel is outdated (due to changes in the SkyPilot code or a version upgrade). - The user's cloud credentials have changed. In this case the new credentials will not be synced, and if there are new clouds available in `sky check`, the cloud depedencies may not be correctly installed. However, this does speed up `jobs launch` _significantly_, so provide it as a dangerous option. Soon we will add robustness checks to `sky launch --fast` that will fix the above caveats, and we can remove this flag and just enable the behavior by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505!
@@ -138,6 +143,7 @@ def launch( | |||
idle_minutes_to_autostop=skylet_constants. | |||
CONTROLLER_IDLE_MINUTES_TO_AUTOSTOP, | |||
retry_until_up=True, | |||
fast=fast, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried with this script:
for i in {1..5}; do
sky jobs launch -y --fast --cpus 2+ -- echo hi2 &
done
wait
The last job failed with FAILED_CONTROLLER
. Have you seen this before? https://gist.github.com/romilbhardwaj/7d1871f1c18b3bb0ccd9141e14bd9fdd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see this. Was able to run seq 100 | xargs -P 5 -n 1 bash -c 'sky jobs launch --fast -yd -n parallel-launch-$0 "echo $0"'
without any issue.
It kind of looks like the controller just died while starting the job. Not sure what would cause this.
Submitted 10 jobs in ~40s - nice!
However, the controller runs only first few jobs then fails. Probably unrelated to this PR:
|
Co-authored-by: Romil Bhardwaj <romil.bhardwaj@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed and tested offline with 100 jobs submitted in parallel. Discovered other bottlenecks unrelated to this PR, which we should file as issues + suggest best practices using xargs
to limit parallelism.
This flag will make the jobs controller launch use
sky launch --fast
. Thereare a few known situations where this can cause misbehavior in the jobs
controller:
version upgrade).
will not be synced, and if there are new clouds available in
sky check
, thecloud depedencies may not be correctly installed.
However, this does speed up
jobs launch
significantly, so provide it as adangerous option. Soon we will add robustness checks to
sky launch --fast
thatwill fix the above caveats, and we can remove this flag and just enable the
behavior by default.
Tested (run the relevant ones):
bash format.sh
conda deactivate; bash -i tests/backward_compatibility_tests.sh