Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix job race condition. #4193

Merged
merged 8 commits into from
Oct 30, 2024
Merged

[Core] Fix job race condition. #4193

merged 8 commits into from
Oct 30, 2024

Conversation

cblmemo
Copy link
Collaborator

@cblmemo cblmemo commented Oct 26, 2024

Fixes #4133.

For examples/multi_echo.py, on the latest master, the failure rate is about 2% (5 out of 256 jobs). This PR has no failure.

(sky) ➜  skypilot git:(master) ✗ sky queue multi-echo-fixx | grep FAILED 
(sky) ➜  skypilot git:(master) ✗ sky queue multi-echo-masterr | grep FAILED
210  -     27 secs ago     -               -         1x[T4:0.5]  FAILED     ~/sky_logs/sky-2024-10-29-21-51-39-047737  
126  -     1 min ago       -               -         1x[T4:0.5]  FAILED     ~/sky_logs/sky-2024-10-29-21-51-05-346333  
111  -     1 min ago       -               -         1x[T4:0.5]  FAILED     ~/sky_logs/sky-2024-10-29-21-51-00-301504  
104  -     1 min ago       -               -         1x[T4:0.5]  FAILED     ~/sky_logs/sky-2024-10-29-21-50-57-334723  
84   -     1 min ago       -               -         1x[T4:0.5]  FAILED     ~/sky_logs/sky-2024-10-29-21-50-49-238716  

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
    • Run examples/multi_echo.py with 256 jobs
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 26, 2024

Update: Seems like after the fix, it still encountered random FAILED status in multi-echo example. Will investigate more.

$ sky queue             
Fetching and parsing job queue...

Job queue of cluster test-multi-echo-memory-24cf
ID  NAME  SUBMITTED    STARTED         DURATION  RESOURCES   STATUS     LOG                                        
32  -     35 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-765241  
31  -     35 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-731798  
30  -     35 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-889227  
29  -     35 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-833135  
28  -     35 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-763923  
27  -     35 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-235621  
26  -     36 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-04-00-109088  
25  -     37 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-59-018059  
24  -     41 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-54-950481  
23  -     41 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-54-881690  
22  -     41 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-54-615505  
21  -     41 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-54-639262  
20  -     41 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-54-464287  
19  -     41 secs ago  -               -         1x[T4:0.5]  FAILED     ~/sky_logs/sky-2024-10-26-14-03-54-275645  
18  -     42 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-54-088479  
17  -     43 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-52-866934  
16  -     48 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-254966  
15  -     48 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-378664
14  -     48 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-254666
13  -     48 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-250899
12  -     48 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-268683
11  -     48 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-256435
10  -     49 secs ago  -               -         1x[T4:0.5]  PENDING    ~/sky_logs/sky-2024-10-26-14-03-47-205305
9   -     49 secs ago  a few secs ago  1s        1x[T4:0.5]  RUNNING    ~/sky_logs/sky-2024-10-26-14-03-47-172039
8   -     1 min ago    a few secs ago  5s        1x[T4:0.5]  RUNNING    ~/sky_logs/sky-2024-10-26-14-03-35-218274
7   -     1 min ago    a few secs ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-186347
6   -     1 min ago    13 secs ago     6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-197969
5   -     1 min ago    17 secs ago     6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-186602
4   -     1 min ago    21 secs ago     6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-208915
3   -     1 min ago    25 secs ago     6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-218412
2   -     1 min ago    28 secs ago     6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-179637
1   -     1 min ago    37 secs ago     12s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-26-14-03-35-182412

if ray_job_id in job_details:
ray_status = job_details[ray_job_id].status
status = _RAY_TO_JOB_STATUS_MAP[ray_status]
if job_id in pending_jobs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not effective, since we are still getting the pending_jobs outside the lock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is a good point.. Refactored. PTAL and testing now!

@cblmemo cblmemo marked this pull request as ready for review October 28, 2024 20:25
@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 28, 2024

Tested with 192 jobs on the multi-echo example passed. It should be ready!

sky queue
Fetching and parsing job queue...

Job queue of cluster multi-echo
ID   NAME  SUBMITTED    STARTED      DURATION  RESOURCES   STATUS     LOG                                        
192  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-13-02-167356  
191  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-13-00-986457  
190  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-52-456049  
189  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-51-830444  
188  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-50-965526  
187  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-50-572829  
186  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-49-701496  
185  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-48-968773  
184  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-45-636131  
183  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-44-594621  
182  -     17 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-33-089645  
181  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-32-741923  
180  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-31-247856  
179  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-31-692901  
178  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-31-076652  
177  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-30-326527  
176  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-28-969599  
175  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-28-909274  
174  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-21-499441  
173  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-21-558663  
172  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-18-179116  
171  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-17-903096  
170  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-17-408714  
169  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-16-779542  
168  -     17 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-16-124634  
167  -     17 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-13-391503  
166  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-06-601122  
165  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-05-591666  
164  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-04-398142  
163  -     18 mins ago  6 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-04-621360  
162  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-03-139993  
161  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-02-634026  
160  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-12-00-631111  
159  -     18 mins ago  6 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-57-936300  
158  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-53-502853  
157  -     18 mins ago  6 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-52-560115  
156  -     18 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-49-118061  
155  -     18 mins ago  6 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-49-353285  
154  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-48-829075  
153  -     18 mins ago  7 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-48-014756  
152  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-45-442742  
151  -     18 mins ago  7 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-43-412023  
150  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-42-492645  
149  -     18 mins ago  7 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-40-542518  
148  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-31-721143  
147  -     18 mins ago  7 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-31-234180  
146  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-31-049774  
145  -     18 mins ago  7 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-31-284445  
144  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-30-660781  
143  -     18 mins ago  7 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-27-664178  
142  -     18 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-26-530728  
141  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-26-081697  
140  -     18 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-14-372342  
139  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-12-763917  
138  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-12-709629  
137  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-12-202589  
136  -     18 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-11-730009  
135  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-11-104493  
134  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-10-808894  
133  -     18 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-10-347721  
132  -     19 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-02-520762  
131  -     19 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-11-00-541923  
130  -     19 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-56-579451  
129  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-56-292527  
128  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-56-156926  
127  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-55-453671  
126  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-55-828501  
125  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-55-392873  
124  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-53-626311  
123  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-50-869824  
122  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-40-233392  
121  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-40-054055  
120  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-40-209306  
119  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-39-318618  
118  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-38-915442  
117  -     19 mins ago  9 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-38-874179  
116  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-37-653385  
115  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-36-342203  
114  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-22-550121  
113  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-22-434417  
112  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-22-136161  
111  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-21-931887  
110  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-22-186426  
109  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-21-577886  
108  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-21-326984  
107  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-20-772396  
106  -     19 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-12-924320  
105  -     19 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-06-386910  
104  -     19 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-06-309217  
103  -     19 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-06-617857  
102  -     19 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-06-581169  
101  -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-05-576467  
100  -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-06-164092  
99   -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-05-870981  
98   -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-10-03-301800  
97   -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-50-863627  
96   -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-50-254479  
95   -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-50-055437  
94   -     20 mins ago  11 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-49-907777  
93   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-49-224937  
92   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-49-353230  
91   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-48-924397  
90   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-48-288230  
89   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-40-337984  
88   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-35-168591  
87   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-34-016081  
86   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-34-057309  
85   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-33-962353  
84   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-33-874441  
83   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-33-180690  
82   -     20 mins ago  12 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-33-160709  
81   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-31-798447  
80   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-754484  
79   -     20 mins ago  13 mins ago  8s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-816213  
78   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-525971  
77   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-681552  
76   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-641127  
75   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-128906  
74   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-24-250361  
73   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-22-942971  
72   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-993527  
71   -     20 mins ago  13 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-999093  
70   -     20 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-987326  
69   -     20 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-988361  
68   -     20 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-986726  
67   -     20 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-959532  
66   -     20 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-981732  
65   -     20 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-14-960040  
64   -     21 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-06-010258  
63   -     21 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-09-06-103875  
62   -     21 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-58-383485  
61   -     21 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-53-351615  
60   -     21 mins ago  14 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-53-173593  
59   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-52-936163  
58   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-52-912080  
57   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-52-649761  
56   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-52-391837  
55   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-51-249945  
54   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-50-990618  
53   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-44-414374  
52   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-39-115994  
51   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-38-209638  
50   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-38-530270  
49   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-38-193650  
48   -     21 mins ago  15 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-37-821802  
47   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-37-277792  
46   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-37-367870  
45   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-37-176952  
44   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-31-079026  
43   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-31-525364  
42   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-30-559086  
41   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-30-141740  
40   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-093076  
39   -     21 mins ago  16 mins ago  8s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-049486  
38   -     21 mins ago  16 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-046632  
37   -     21 mins ago  16 mins ago  11s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-043714  
36   -     21 mins ago  17 mins ago  13s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-040252  
35   -     21 mins ago  17 mins ago  28s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-047743  
34   -     21 mins ago  17 mins ago  12s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-24-015548  
33   -     21 mins ago  18 mins ago  11s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-08-23-976819  
32   -     22 mins ago  18 mins ago  41s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-17-658069  
31   -     22 mins ago  18 mins ago  11s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-17-638542  
30   -     22 mins ago  18 mins ago  11s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-17-444940  
29   -     22 mins ago  19 mins ago  23s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-17-046370  
28   -     22 mins ago  19 mins ago  9s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-16-253004  
27   -     22 mins ago  19 mins ago  10s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-11-055884  
26   -     22 mins ago  19 mins ago  22s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-10-865600  
25   -     22 mins ago  20 mins ago  10s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-10-820502  
24   -     22 mins ago  20 mins ago  24s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-10-326193  
23   -     22 mins ago  20 mins ago  8s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-10-090867  
22   -     22 mins ago  20 mins ago  23s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-09-892731  
21   -     22 mins ago  20 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-10-071996  
20   -     22 mins ago  21 mins ago  8s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-09-257016  
19   -     23 mins ago  21 mins ago  8s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-03-617943  
18   -     23 mins ago  21 mins ago  43s       1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-03-333942  
17   -     23 mins ago  21 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-07-03-139840  
16   -     23 mins ago  21 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-330274  
15   -     23 mins ago  21 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-358138  
14   -     23 mins ago  21 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-294934  
13   -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-214243  
12   -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-195159  
11   -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-086991  
10   -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-247200  
9    -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-57-006863  
8    -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-521903  
7    -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-402809  
6    -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-395510  
5    -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-395694  
4    -     23 mins ago  22 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-391061  
3    -     23 mins ago  22 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-396063  
2    -     23 mins ago  22 mins ago  9s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-392166  
1    -     23 mins ago  23 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-2024-10-28-13-06-45-391832 

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this @cblmemo! Left a comment about the performance.

For testing this PR, is it possible that we run the multi_echo many times and see if we can find a way to increase the concurrency there to fully test this? Also, it would be good to run all the smoke tests for this to avoid any regression : )

statuses = []
for job_id, status in zip(job_ids, job_statuses):
for job_id in job_ids:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update the ray==2.4.0 in the docstr above?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to ray >= 2.4.0 as IIUC this is a newly added feature in ray 2.4.0?

sky/skylet/job_lib.py Outdated Show resolved Hide resolved
@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 29, 2024

Thanks a lot for fixing this @cblmemo! Left a comment about the performance.

For testing this PR, is it possible that we run the multi_echo many times and see if we can find a way to increase the concurrency there to fully test this? Also, it would be good to run all the smoke tests for this to avoid any regression : )

Good point! I updated the max concurrency to 32 and lest see the performance. For smoke tests, I think Romil is running the smoke test for v0.7 release, do you think it is possible to run this PR simultaneously? cc @romilbhardwaj

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 29, 2024

Oh I think when we increase the concurrency, the SSH failed first..

E 10-28 17:18:48 subprocess_utils.py:85] mux_client_request_session: session request failed: Session open refused by peer
E 10-28 17:18:48 subprocess_utils.py:85] kex_exchange_identification: read: Connection reset by peer
E 10-28 17:18:48 subprocess_utils.py:85]

@Michaelvll
Copy link
Collaborator

Oh I think when we increase the concurrency, the SSH failed first..

E 10-28 17:18:48 subprocess_utils.py:85] mux_client_request_session: session request failed: Session open refused by peer
E 10-28 17:18:48 subprocess_utils.py:85] kex_exchange_identification: read: Connection reset by peer
E 10-28 17:18:48 subprocess_utils.py:85]

I suppose this can be fixed by disabling control master or can we just simulate the concurrent job submission on the remote cluster, ie generating a lot task app on the remote cluster and simultaneously run the command that is executed by the job submission script on the remote?

@romilbhardwaj romilbhardwaj added this to the v0.7 milestone Oct 29, 2024
@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 29, 2024

Just tested with 128 jobs concurrently submitted to the cluster, and it works!

sky queue
Fetching and parsing job queue...

Job queue of cluster t-multi-echo
ID   NAME     SUBMITTED    STARTED      DURATION  RESOURCES   STATUS     LOG                                        
130  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-458796099        
129  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-535835570        
128  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-815629051        
127  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-867653101        
126  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-394400685        
125  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-389785135        
124  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-018900992        
123  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-755551186        
122  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-385794973        
121  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-969498972        
120  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-101400629        
119  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-293284469        
118  sky-cmd  12 mins ago  2 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-289634392        
117  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-466862257        
116  sky-cmd  12 mins ago  2 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-758000335        
115  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-438764491        
114  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-276099579        
113  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-017898999        
112  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-164411403        
111  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-860967090        
110  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-170806416        
109  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-287758911        
108  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-979868870        
107  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-840845623        
106  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-521760666        
105  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-000261083        
104  sky-cmd  12 mins ago  3 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-081235373        
103  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-760515548        
102  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-221834421        
101  sky-cmd  12 mins ago  3 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-818800957        
100  sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-967526310        
99   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-130284141        
98   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-169090472        
97   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-078802642        
96   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-143847823        
95   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-809577207        
94   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-769950108        
93   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-359808016        
92   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-158964465        
91   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-960828980        
90   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-769797168        
89   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-015588439        
88   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-508783014        
87   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-979128860        
86   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-861623317        
85   sky-cmd  12 mins ago  4 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-041982627        
84   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-773851310        
83   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-458920059        
82   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-789945828        
81   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-976351809        
80   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-429309100        
79   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-923770202        
78   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-851627131        
77   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-371620506        
76   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-822339383        
75   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-295110263        
74   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-767133123        
73   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-756071274        
72   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-855964327        
71   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-838838861        
70   sky-cmd  12 mins ago  5 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-436820357        
69   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-771270967        
68   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-229042113        
67   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-843375038        
66   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-297864600        
65   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-796136091        
64   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-920939611        
63   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-040181697        
62   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-189478444        
61   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-415815623        
60   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-827107651        
59   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-363755437        
58   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-280384024        
57   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-847906293        
56   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-207925089        
55   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-013562788        
54   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-280776183        
53   sky-cmd  12 mins ago  6 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-223811559        
52   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-433794013        
51   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-751455325        
50   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-086873008        
49   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-794905572        
48   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-167085868        
47   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-864924090        
46   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-446836871        
45   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-310185851        
44   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-310790162        
43   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-908960204        
42   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-899625716        
41   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-361800334        
40   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-888880276        
39   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-796611576        
38   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-202954705        
37   sky-cmd  12 mins ago  7 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-753260322        
36   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-824394249        
35   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-933808239        
34   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-803870670        
33   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-033906207        
32   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-757421981        
31   sky-cmd  12 mins ago  10 mins ago  7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-516785396        
30   sky-cmd  12 mins ago  8 mins ago   7s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-034736416        
29   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-911820007        
28   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-159799820        
27   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-759455241        
26   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-932884279        
25   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-083763376        
24   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-524548302        
23   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-079312861        
22   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-881186127        
21   sky-cmd  12 mins ago  8 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-096913143        
20   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-132760102        
19   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-139847681        
18   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-931096723        
17   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-970157418        
16   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-357840684        
15   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-144199747        
14   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-913760721        
13   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-872838430        
12   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-922907060        
11   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-146186793        
10   sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-768559702        
9    sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-878908464        
8    sky-cmd  12 mins ago  10 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-773364119        
7    sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-904813969        
6    sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238198-214890160        
5    sky-cmd  12 mins ago  10 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-793735664        
4    sky-cmd  12 mins ago  9 mins ago   6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-988767355        
3    sky-cmd  12 mins ago  10 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730238197-766337147        
2    sky-cmd  18 mins ago  17 mins ago  6s        1x[T4:0.5]  SUCCEEDED  ~/sky_logs/sky-1730237902-602957752        
1    sky-cmd  19 mins ago  19 mins ago  2s        2x[T4:1]    SUCCEEDED  ~/sky_logs/sky-2024-10-29-14-34-54-678435

The script I use (need to manually set the internal IPs) is as follows:

# run_all.sh
for i in {1..128}; do
    bash ./run.sh &
done
wait  # Waits for all background processes to finish
# run.sh
TIMESTAMP=$(date +%s-%N)
RAY_TASK_LOG_DIR=~/sky_logs/sky-$TIMESTAMP
INTERNAL_IPS_1=10.128.0.123
INTERNAL_IPS_2=10.128.0.124
RAY_JOB_ID_ENV_VAR=$(/home/gcpuser/skypilot-runtime/bin/python -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'memory'"'"', '"'"'sky-'"${TIMESTAMP}""'"', '"'"'1x[T4:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' | grep -oP 'Job ID: \K[0-9]+')
echo Submitting job $RAY_JOB_ID_ENV_VAR
cd ~/sky_workdir && mkdir -p $RAY_TASK_LOG_DIR && touch $RAY_TASK_LOG_DIR/run.log && { echo 'import getpass
import hashlib
import io
import os
import pathlib
import selectors
import shlex
import subprocess
import sys
import tempfile
import textwrap
import time
from typing import Dict, List, Optional, Tuple, Union

import ray
import ray.util as ray_util

from sky.skylet import autostop_lib
from sky.skylet import constants
from sky.skylet import job_lib
from sky.utils import log_utils

SKY_REMOTE_WORKDIR = '"'"'~/sky_workdir'"'"'

kwargs = dict()
# Only set the `_temp_dir` to SkyPilot'"'"'s ray cluster directory when
# the directory exists for backward compatibility for the VM
# launched before #1790.
if os.path.exists('"'"'/tmp/ray_skypilot'"'"'):
    kwargs['"'"'_temp_dir'"'"'] = '"'"'/tmp/ray_skypilot'"'"'
ray.init(
    address='"'"'auto'"'"',
    namespace='"'"'__sky__'"${RAY_JOB_ID_ENV_VAR}"'__'"'"',
    log_to_driver=True,
    **kwargs
)
def get_or_fail(futures, pg) -> List[int]:
    """Wait for tasks, if any fails, cancel all unready."""
    returncodes = [1] * len(futures)
    # Wait for 1 task to be ready.
    ready = []
    # Keep invoking ray.wait if ready is empty. This is because
    # ray.wait with timeout=None will only wait for 10**6 seconds,
    # which will cause tasks running for more than 12 days to return
    # before becoming ready.
    # (Such tasks are common in serving jobs.)
    # Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846
    while not ready:
        ready, unready = ray.wait(futures)
    idx = futures.index(ready[0])
    returncodes[idx] = ray.get(ready[0])
    while unready:
        if returncodes[idx] != 0:
            for task in unready:
                # ray.cancel without force fails to kill tasks.
                # We use force=True to kill unready tasks.
                ray.cancel(task, force=True)
                # Use SIGKILL=128+9 to indicate the task is forcely
                # killed.
                idx = futures.index(task)
                returncodes[idx] = 137
            break
        ready, unready = ray.wait(unready)
        idx = futures.index(ready[0])
        returncodes[idx] = ray.get(ready[0])
    # Remove the placement group after all tasks are done, so that
    # the next job can be scheduled on the released resources
    # immediately.
    ray_util.remove_placement_group(pg)
    sys.stdout.flush()
    return returncodes

run_fn = None
futures = []

class _ProcessingArgs:
    """Arguments for processing logs."""

    def __init__(self,
                 log_path: str,
                 stream_logs: bool,
                 start_streaming_at: str = '"'"''"'"',
                 end_streaming_at: Optional[str] = None,
                 skip_lines: Optional[List[str]] = None,
                 replace_crlf: bool = False,
                 line_processor: Optional[log_utils.LineProcessor] = None,
                 streaming_prefix: Optional[str] = None) -> None:
        self.log_path = log_path
        self.stream_logs = stream_logs
        self.start_streaming_at = start_streaming_at
        self.end_streaming_at = end_streaming_at
        self.skip_lines = skip_lines
        self.replace_crlf = replace_crlf
        self.line_processor = line_processor
        self.streaming_prefix = streaming_prefix

def _handle_io_stream(io_stream, out_stream, args: _ProcessingArgs):
    """Process the stream of a process."""
    out_io = io.TextIOWrapper(io_stream,
                              encoding='"'"'utf-8'"'"',
                              newline='"'"''"'"',
                              errors='"'"'replace'"'"',
                              write_through=True)

    start_streaming_flag = False
    end_streaming_flag = False
    streaming_prefix = args.streaming_prefix if args.streaming_prefix else '"'"''"'"'
    line_processor = (log_utils.LineProcessor()
                      if args.line_processor is None else args.line_processor)

    out = []
    with open(args.log_path, '"'"'a'"'"', encoding='"'"'utf-8'"'"') as fout:
        with line_processor:
            while True:
                line = out_io.readline()
                if not line:
                    break
                # start_streaming_at logic in processor.process_line(line)
                if args.replace_crlf and line.endswith('"'"'\r\n'"'"'):
                    # Replace CRLF with LF to avoid ray logging to the same
                    # line due to separating lines with '"'"'\n'"'"'.
                    line = line[:-2] + '"'"'\n'"'"'
                if (args.skip_lines is not None and
                        any(skip in line for skip in args.skip_lines)):
                    continue
                if args.start_streaming_at in line:
                    start_streaming_flag = True
                if (args.end_streaming_at is not None and
                        args.end_streaming_at in line):
                    # Keep executing the loop, only stop streaming.
                    # E.g., this is used for `sky bench` to hide the
                    # redundant messages of `sky launch` while
                    # saving them in log files.
                    end_streaming_flag = True
                if (args.stream_logs and start_streaming_flag and
                        not end_streaming_flag):
                    print(streaming_prefix + line,
                          end='"'"''"'"',
                          file=out_stream,
                          flush=True)
                if args.log_path != '"'"'/dev/null'"'"':
                    fout.write(line)
                    fout.flush()
                line_processor.process_line(line)
                out.append(line)
    return '"'"''"'"'.join(out)

def process_subprocess_stream(proc, args: _ProcessingArgs) -> Tuple[str, str]:
    """Redirect the process'"'"'s filtered stdout/stderr to both stream and file"""
    if proc.stderr is not None:
        # Asyncio does not work as the output processing can be executed in a
        # different thread.
        # selectors is possible to handle the multiplexing of stdout/stderr,
        # but it introduces buffering making the output not streaming.
        with multiprocessing.pool.ThreadPool(processes=1) as pool:
            err_args = copy.copy(args)
            err_args.line_processor = None
            stderr_fut = pool.apply_async(_handle_io_stream,
                                          args=(proc.stderr, sys.stderr,
                                                err_args))
            # Do not launch a thread for stdout as the rich.status does not
            # work in a thread, which is used in
            # log_utils.RayUpLineProcessor.
            stdout = _handle_io_stream(proc.stdout, sys.stdout, args)
            stderr = stderr_fut.get()
    else:
        stdout = _handle_io_stream(proc.stdout, sys.stdout, args)
        stderr = '"'"''"'"'
    return stdout, stderr

def run_with_log(
    cmd: Union[List[str], str],
    log_path: str,
    *,
    require_outputs: bool = False,
    stream_logs: bool = False,
    start_streaming_at: str = '"'"''"'"',
    end_streaming_at: Optional[str] = None,
    skip_lines: Optional[List[str]] = None,
    shell: bool = False,
    with_ray: bool = False,
    process_stream: bool = True,
    line_processor: Optional[log_utils.LineProcessor] = None,
    streaming_prefix: Optional[str] = None,
    **kwargs,
) -> Union[int, Tuple[int, str, str]]:
    """Runs a command and logs its output to a file.

    Args:
        cmd: The command to run.
        log_path: The path to the log file.
        stream_logs: Whether to stream the logs to stdout/stderr.
        require_outputs: Whether to return the stdout/stderr of the command.
        process_stream: Whether to post-process the stdout/stderr of the
            command, such as replacing or skipping lines on the fly. If
            enabled, lines are printed only when '"'"'\r'"'"' or '"'"'\n'"'"' is found.

    Returns the returncode or returncode, stdout and stderr of the command.
      Note that the stdout and stderr is already decoded.
    """
    assert process_stream or not require_outputs, (
        process_stream, require_outputs,
        '"'"'require_outputs should be False when process_stream is False'"'"')

    log_path = os.path.expanduser(log_path)
    dirname = os.path.dirname(log_path)
    os.makedirs(dirname, exist_ok=True)
    # Redirect stderr to stdout when using ray, to preserve the order of
    # stdout and stderr.
    stdout_arg = stderr_arg = None
    if process_stream:
        stdout_arg = subprocess.PIPE
        stderr_arg = subprocess.PIPE if not with_ray else subprocess.STDOUT
    with subprocess.Popen(cmd,
                          stdout=stdout_arg,
                          stderr=stderr_arg,
                          start_new_session=True,
                          shell=shell,
                          **kwargs) as proc:
        try:
            # The proc can be defunct if the python program is killed. Here we
            # open a new subprocess to gracefully kill the proc, SIGTERM
            # and then SIGKILL the process group.
            # Adapted from ray/dashboard/modules/job/job_manager.py#L154
            parent_pid = os.getpid()
            daemon_script = os.path.join(
                os.path.dirname(os.path.abspath(job_lib.__file__)),
                '"'"'subprocess_daemon.py'"'"')
            python_path = subprocess.check_output(
                constants.SKY_GET_PYTHON_PATH_CMD,
                shell=True,
                stderr=subprocess.DEVNULL,
                encoding='"'"'utf-8'"'"').strip()
            daemon_cmd = [
                python_path,
                daemon_script,
                '"'"'--parent-pid'"'"',
                str(parent_pid),
                '"'"'--proc-pid'"'"',
                str(proc.pid),
            ]

            # We do not need to set `start_new_session=True` here, as the
            # daemon script will detach itself from the parent process with
            # fork to avoid being killed by ray job. See the reason we
            # daemonize the process in `sky/skylet/subprocess_daemon.py`.
            subprocess.Popen(
                daemon_cmd,
                # Suppress output
                stdout=subprocess.DEVNULL,
                stderr=subprocess.DEVNULL,
                # Disable input
                stdin=subprocess.DEVNULL,
            )
            stdout = '"'"''"'"'
            stderr = '"'"''"'"'

            if process_stream:
                if skip_lines is None:
                    skip_lines = []
                # Skip these lines caused by `-i` option of bash. Failed to
                # find other way to turn off these two warning.
                # https://stackoverflow.com/questions/13300764/how-to-tell-bash-not-to-issue-warnings-cannot-set-terminal-process-group-and # pylint: disable=line-too-long
                # `ssh -T -i -tt` still cause the problem.
                skip_lines += [
                    '"'"'bash: cannot set terminal process group'"'"',
                    '"'"'bash: no job control in this shell'"'"',
                ]
                # We need this even if the log_path is '"'"'/dev/null'"'"' to ensure the
                # progress bar is shown.
                # NOTE: Lines are printed only when '"'"'\r'"'"' or '"'"'\n'"'"' is found.
                args = _ProcessingArgs(
                    log_path=log_path,
                    stream_logs=stream_logs,
                    start_streaming_at=start_streaming_at,
                    end_streaming_at=end_streaming_at,
                    skip_lines=skip_lines,
                    line_processor=line_processor,
                    # Replace CRLF when the output is logged to driver by ray.
                    replace_crlf=with_ray,
                    streaming_prefix=streaming_prefix,
                )
                stdout, stderr = process_subprocess_stream(proc, args)
            proc.wait()
            if require_outputs:
                return proc.returncode, stdout, stderr
            return proc.returncode
        except KeyboardInterrupt:
            # Kill the subprocess directly, otherwise, the underlying
            # process will only be killed after the python program exits,
            # causing the stream handling stuck at `readline`.
            subprocess_utils.kill_children_processes()
            raise

def make_task_bash_script(codegen: str,
                          env_vars: Optional[Dict[str, str]] = None) -> str:
    # set -a is used for exporting all variables functions to the environment
    # so that bash `user_script` can access `conda activate`. Detail: #436.
    # Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long
    # DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as
    # the ray cluster is started within the runtime env, which may cause the
    # user program to run in that env as well.
    # PYTHONUNBUFFERED is used to disable python output buffering.
    script = [
        textwrap.dedent(f"""\
            #!/bin/bash
            source ~/.bashrc
            set -a
            . $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 || true
            set +a
            {constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV}
            export PYTHONUNBUFFERED=1
            cd {constants.SKY_REMOTE_WORKDIR}"""),
    ]
    if env_vars is not None:
        for k, v in env_vars.items():
            script.append(f'"'"'export {k}={shlex.quote(str(v))}'"'"')
    script += [
        codegen,
        '"'"''"'"',  # New line at EOF.
    ]
    script = '"'"'\n'"'"'.join(script)
    return script

def add_ray_env_vars(
        env_vars: Optional[Dict[str, str]] = None) -> Dict[str, str]:
    # Adds Ray-related environment variables.
    if env_vars is None:
        env_vars = {}
    ray_env_vars = [
        '"'"'CUDA_VISIBLE_DEVICES'"'"', '"'"'RAY_CLIENT_MODE'"'"', '"'"'RAY_JOB_ID'"'"',
        '"'"'RAY_RAYLET_PID'"'"', '"'"'OMP_NUM_THREADS'"'"'
    ]
    env_dict = dict(os.environ)
    for env_var in ray_env_vars:
        if env_var in env_dict:
            env_vars[env_var] = env_dict[env_var]
    return env_vars

def run_bash_command_with_log(bash_command: str,
                              log_path: str,
                              env_vars: Optional[Dict[str, str]] = None,
                              stream_logs: bool = False,
                              with_ray: bool = False):
    with tempfile.NamedTemporaryFile('"'"'w'"'"', prefix='"'"'sky_app_'"'"',
                                     delete=False) as fp:
        bash_command = make_task_bash_script(bash_command, env_vars=env_vars)
        fp.write(bash_command)
        fp.flush()
        script_path = fp.name

        # Need this `-i` option to make sure `source ~/.bashrc` work.
        inner_command = f'"'"'/bin/bash -i {script_path}'"'"'

        subprocess_cmd: Union[str, List[str]]
        subprocess_cmd = inner_command

        return run_with_log(
            subprocess_cmd,
            log_path,
            stream_logs=stream_logs,
            with_ray=with_ray,
            # Disable input to avoid blocking.
            stdin=subprocess.DEVNULL,
            shell=True)

run_bash_command_with_log = ray.remote(run_bash_command_with_log)
if hasattr(autostop_lib, '"'"'set_last_active_time_to_now'"'"'):
    autostop_lib.set_last_active_time_to_now()

job_lib.set_status('"${RAY_JOB_ID_ENV_VAR}"', job_lib.JobStatus.PENDING)
pg = ray_util.placement_group([{"CPU": 0.5, "T4": 0.5, "GPU": 0.5}], '"'"'STRICT_SPREAD'"'"')
plural = '"'"'s'"'"' if 1 > 1 else '"'"''"'"'
node_str = f'"'"'1 node{plural}'"'"'

# We have this `INFO: Tip:` message only for backward
# compatibility, because if a cluster has the old SkyPilot version,
# it relies on this message to start log streaming.
# This message will be skipped for new clusters, because we use
# start_streaming_at for the `Waiting for task resources on`
# message.
# TODO: Remove this message in v0.9.0.
message = ('"'"'�[2m├── �[0m�[2mINFO: '"'"'
           '"'"'Tip: use Ctrl-C to exit log streaming, not kill '"'"'
           '"'"'the job.�[0m\n'"'"')
message += ('"'"'�[2m├── �[0m�[2m'"'"'
            '"'"'Waiting for task resources on '"'"'
           f'"'"'{node_str}.�[0m'"'"')
print(message, flush=True)
# FIXME: This will print the error message from autoscaler if
# it is waiting for other task to finish. We should hide the
# error message.
ray.get(pg.ready())
print('"'"'\x1b[2m└── \x1b[0mJob started. Streaming logs... \x1b[2m(Ctrl-C to exit log streaming; job will not be killed)\x1b[0m'"'"', flush=True)

job_lib.set_job_started('"${RAY_JOB_ID_ENV_VAR}"')
job_lib.scheduler.schedule_step()
@ray.remote
def check_ip():
    return ray.util.get_node_ip_address()
gang_scheduling_id_to_ip = ray.get([
    check_ip.options(
            num_cpus=0.5,
            scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(
                placement_group=pg,
                placement_group_bundle_index=i
            )).remote()
    for i in range(pg.bundle_count)
])

cluster_ips_to_node_id = {ip: i for i, ip in enumerate(['"'${INTERNAL_IPS_1}'"', '"'${INTERNAL_IPS_2}'"'])}
job_ip_rank_list = sorted(gang_scheduling_id_to_ip, key=cluster_ips_to_node_id.get)
job_ip_rank_map = {ip: i for i, ip in enumerate(job_ip_rank_list)}
job_ip_list_str = '"'"'\n'"'"'.join(job_ip_rank_list)

sky_env_vars_dict = {}
sky_env_vars_dict['"'"'SKYPILOT_NODE_IPS'"'"'] = job_ip_list_str
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_NODE_IPS'"'"'] = job_ip_list_str
sky_env_vars_dict['"'"'SKYPILOT_NUM_NODES'"'"'] = len(job_ip_rank_list)

sky_env_vars_dict['"'"'SKYPILOT_TASK_ID'"'"'] = '"'"'sky-2024-10-29-10-09-37-475191_multi-echo-test_1'"'"'
sky_env_vars_dict['"'"'SKYPILOT_CLUSTER_INFO'"'"'] = '"'"'{"cluster_name": "multi-echo-test", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"}'"'"'
script = '"'"'echo 0; sleep 5'"'"'
if run_fn is not None:
    script = run_fn(0, gang_scheduling_id_to_ip)


if script is not None:
    sky_env_vars_dict['"'"'SKYPILOT_NUM_GPUS_PER_NODE'"'"'] = 1
    # Backward compatibility: Environment starting with `SKY_` is
    # deprecated. Remove it in v0.9.0.
    sky_env_vars_dict['"'"'SKY_NUM_GPUS_PER_NODE'"'"'] = 1

    ip = gang_scheduling_id_to_ip[0]
    rank = job_ip_rank_map[ip]

    if len(cluster_ips_to_node_id) == 1: # Single-node task on single-node cluter
        name_str = '"'"'None,'"'"' if None != None else '"'"'task,'"'"'
        log_path = os.path.expanduser(os.path.join('"'""${RAY_TASK_LOG_DIR}/tasks""'"', '"'"'run.log'"'"'))
    else: # Single-node or multi-node task on multi-node cluster
        idx_in_cluster = cluster_ips_to_node_id[ip]
        if cluster_ips_to_node_id[ip] == 0:
            node_name = '"'"'head'"'"'
        else:
            node_name = f'"'"'worker{idx_in_cluster}'"'"'
        name_str = f'"'"'{node_name}, rank={rank},'"'"'
        log_path = os.path.expanduser(os.path.join('"'""${RAY_TASK_LOG_DIR}/tasks""'"', f'"'"'{rank}-{node_name}.log'"'"'))
    sky_env_vars_dict['"'"'SKYPILOT_NODE_RANK'"'"'] = rank
    # Backward compatibility: Environment starting with `SKY_` is
    # deprecated. Remove it in v0.9.0.
    sky_env_vars_dict['"'"'SKY_NODE_RANK'"'"'] = rank

    sky_env_vars_dict['"'"'SKYPILOT_INTERNAL_JOB_ID'"'"'] = '"${RAY_JOB_ID_ENV_VAR}"'
    # Backward compatibility: Environment starting with `SKY_` is
    # deprecated. Remove it in v0.9.0.
    sky_env_vars_dict['"'"'SKY_INTERNAL_JOB_ID'"'"'] = '"${RAY_JOB_ID_ENV_VAR}"'

    futures.append(run_bash_command_with_log \
            .options(name=name_str, num_cpus=0.5, resources={"T4": 0.5}, num_gpus=0.5, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0)) \
            .remote(
                script,
                log_path,
                env_vars=sky_env_vars_dict,
                stream_logs=True,
                with_ray=True,
            ))
returncodes = get_or_fail(futures, pg)
if sum(returncodes) != 0:
    job_lib.set_status('"${RAY_JOB_ID_ENV_VAR}"', job_lib.JobStatus.FAILED)
    # Schedule the next pending job immediately to make the job
    # scheduling more efficient.
    job_lib.scheduler.schedule_step()
    # This waits for all streaming logs to finish.
    time.sleep(0.5)
    reason = '"'"''"'"'
    # 139 is the return code of SIGSEGV, i.e. Segmentation Fault.
    if any(r == 139 for r in returncodes):
        reason = '"'"'(likely due to Segmentation Fault)'"'"'
    print('"'"'ERROR: �[31mJob '"${RAY_JOB_ID_ENV_VAR}"' failed with '"'"'
          '"'"'return code list:�[0m'"'"',
          returncodes,
          reason,
          flush=True)
    # Need this to set the job status in ray job to be FAILED.
    sys.exit(1)
else:
    job_lib.set_status('"${RAY_JOB_ID_ENV_VAR}"', job_lib.JobStatus.SUCCEEDED)
    # Schedule the next pending job immediately to make the job
    # scheduling more efficient.
    job_lib.scheduler.schedule_step()
    # This waits for all streaming logs to finish.
    time.sleep(0.5)
' > ~/.sky/sky_app/sky_job_$RAY_JOB_ID_ENV_VAR; } && /home/gcpuser/skypilot-runtime/bin/python -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_lib.scheduler.queue('"${RAY_JOB_ID_ENV_VAR}"','"'"'cd ~/sky_workdir && /home/gcpuser/skypilot-runtime/bin/python /home/gcpuser/skypilot-runtime/bin/ray job submit --address=http://127.0.0.1:8266 --submission-id '"${RAY_JOB_ID_ENV_VAR}"'-$(whoami) --no-wait "/home/gcpuser/skypilot-runtime/bin/python -u ~/.sky/sky_app/sky_job_'"${RAY_JOB_ID_ENV_VAR}"' > '"${RAY_TASK_LOG_DIR}"'/run.log 2> /dev/null"'"'"')'

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for fixing this @cblmemo! This fixes an important issue an user was facing. LGTM!

Comment on lines 555 to 557
for job_detail in job_detail_lists:
if job_detail.submission_id in ray_job_ids_set:
job_details[job_detail.submission_id] = job_detail
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, why we don't only keep the job within the ray_job_ids_set? It's quite minor but may save some memory : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! It is from a refactoring back to query job list instead of independent job status. Change back now!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rerun the multi echo test and still no error found :)) Will merge after all smoke test passed!

@cblmemo
Copy link
Collaborator Author

cblmemo commented Oct 30, 2024

All smoke test besides #4211 (also failed on master) passed! Merging now.

@cblmemo cblmemo added this pull request to the merge queue Oct 30, 2024
Merged via the queue into master with commit 9f055f4 Oct 30, 2024
20 checks passed
@cblmemo cblmemo deleted the fix-job-race-condition branch October 30, 2024 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] A potential critical race condition for job scheduling within a cluster
3 participants