-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix job race condition. #4193
Conversation
Update: Seems like after the fix, it still encountered random FAILED status in multi-echo example. Will investigate more. $ sky queue
Fetching and parsing job queue...
Job queue of cluster test-multi-echo-memory-24cf
ID NAME SUBMITTED STARTED DURATION RESOURCES STATUS LOG
32 - 35 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-765241
31 - 35 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-731798
30 - 35 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-889227
29 - 35 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-833135
28 - 35 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-763923
27 - 35 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-235621
26 - 36 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-04-00-109088
25 - 37 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-59-018059
24 - 41 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-54-950481
23 - 41 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-54-881690
22 - 41 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-54-615505
21 - 41 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-54-639262
20 - 41 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-54-464287
19 - 41 secs ago - - 1x[T4:0.5] FAILED ~/sky_logs/sky-2024-10-26-14-03-54-275645
18 - 42 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-54-088479
17 - 43 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-52-866934
16 - 48 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-254966
15 - 48 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-378664
14 - 48 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-254666
13 - 48 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-250899
12 - 48 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-268683
11 - 48 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-256435
10 - 49 secs ago - - 1x[T4:0.5] PENDING ~/sky_logs/sky-2024-10-26-14-03-47-205305
9 - 49 secs ago a few secs ago 1s 1x[T4:0.5] RUNNING ~/sky_logs/sky-2024-10-26-14-03-47-172039
8 - 1 min ago a few secs ago 5s 1x[T4:0.5] RUNNING ~/sky_logs/sky-2024-10-26-14-03-35-218274
7 - 1 min ago a few secs ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-186347
6 - 1 min ago 13 secs ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-197969
5 - 1 min ago 17 secs ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-186602
4 - 1 min ago 21 secs ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-208915
3 - 1 min ago 25 secs ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-218412
2 - 1 min ago 28 secs ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-179637
1 - 1 min ago 37 secs ago 12s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-26-14-03-35-182412 |
sky/skylet/job_lib.py
Outdated
if ray_job_id in job_details: | ||
ray_status = job_details[ray_job_id].status | ||
status = _RAY_TO_JOB_STATUS_MAP[ray_status] | ||
if job_id in pending_jobs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not effective, since we are still getting the pending_jobs
outside the lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that is a good point.. Refactored. PTAL and testing now!
Tested with 192 jobs on the multi-echo example passed. It should be ready! sky queue
Fetching and parsing job queue...
Job queue of cluster multi-echo
ID NAME SUBMITTED STARTED DURATION RESOURCES STATUS LOG
192 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-13-02-167356
191 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-13-00-986457
190 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-52-456049
189 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-51-830444
188 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-50-965526
187 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-50-572829
186 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-49-701496
185 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-48-968773
184 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-45-636131
183 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-44-594621
182 - 17 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-33-089645
181 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-32-741923
180 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-31-247856
179 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-31-692901
178 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-31-076652
177 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-30-326527
176 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-28-969599
175 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-28-909274
174 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-21-499441
173 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-21-558663
172 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-18-179116
171 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-17-903096
170 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-17-408714
169 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-16-779542
168 - 17 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-16-124634
167 - 17 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-13-391503
166 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-06-601122
165 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-05-591666
164 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-04-398142
163 - 18 mins ago 6 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-04-621360
162 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-03-139993
161 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-02-634026
160 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-12-00-631111
159 - 18 mins ago 6 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-57-936300
158 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-53-502853
157 - 18 mins ago 6 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-52-560115
156 - 18 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-49-118061
155 - 18 mins ago 6 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-49-353285
154 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-48-829075
153 - 18 mins ago 7 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-48-014756
152 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-45-442742
151 - 18 mins ago 7 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-43-412023
150 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-42-492645
149 - 18 mins ago 7 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-40-542518
148 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-31-721143
147 - 18 mins ago 7 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-31-234180
146 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-31-049774
145 - 18 mins ago 7 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-31-284445
144 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-30-660781
143 - 18 mins ago 7 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-27-664178
142 - 18 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-26-530728
141 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-26-081697
140 - 18 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-14-372342
139 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-12-763917
138 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-12-709629
137 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-12-202589
136 - 18 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-11-730009
135 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-11-104493
134 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-10-808894
133 - 18 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-10-347721
132 - 19 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-02-520762
131 - 19 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-11-00-541923
130 - 19 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-56-579451
129 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-56-292527
128 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-56-156926
127 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-55-453671
126 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-55-828501
125 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-55-392873
124 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-53-626311
123 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-50-869824
122 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-40-233392
121 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-40-054055
120 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-40-209306
119 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-39-318618
118 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-38-915442
117 - 19 mins ago 9 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-38-874179
116 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-37-653385
115 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-36-342203
114 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-22-550121
113 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-22-434417
112 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-22-136161
111 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-21-931887
110 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-22-186426
109 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-21-577886
108 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-21-326984
107 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-20-772396
106 - 19 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-12-924320
105 - 19 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-06-386910
104 - 19 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-06-309217
103 - 19 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-06-617857
102 - 19 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-06-581169
101 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-05-576467
100 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-06-164092
99 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-05-870981
98 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-10-03-301800
97 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-50-863627
96 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-50-254479
95 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-50-055437
94 - 20 mins ago 11 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-49-907777
93 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-49-224937
92 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-49-353230
91 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-48-924397
90 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-48-288230
89 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-40-337984
88 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-35-168591
87 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-34-016081
86 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-34-057309
85 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-33-962353
84 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-33-874441
83 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-33-180690
82 - 20 mins ago 12 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-33-160709
81 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-31-798447
80 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-754484
79 - 20 mins ago 13 mins ago 8s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-816213
78 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-525971
77 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-681552
76 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-641127
75 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-128906
74 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-24-250361
73 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-22-942971
72 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-993527
71 - 20 mins ago 13 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-999093
70 - 20 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-987326
69 - 20 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-988361
68 - 20 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-986726
67 - 20 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-959532
66 - 20 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-981732
65 - 20 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-14-960040
64 - 21 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-06-010258
63 - 21 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-09-06-103875
62 - 21 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-58-383485
61 - 21 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-53-351615
60 - 21 mins ago 14 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-53-173593
59 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-52-936163
58 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-52-912080
57 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-52-649761
56 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-52-391837
55 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-51-249945
54 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-50-990618
53 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-44-414374
52 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-39-115994
51 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-38-209638
50 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-38-530270
49 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-38-193650
48 - 21 mins ago 15 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-37-821802
47 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-37-277792
46 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-37-367870
45 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-37-176952
44 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-31-079026
43 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-31-525364
42 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-30-559086
41 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-30-141740
40 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-093076
39 - 21 mins ago 16 mins ago 8s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-049486
38 - 21 mins ago 16 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-046632
37 - 21 mins ago 16 mins ago 11s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-043714
36 - 21 mins ago 17 mins ago 13s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-040252
35 - 21 mins ago 17 mins ago 28s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-047743
34 - 21 mins ago 17 mins ago 12s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-24-015548
33 - 21 mins ago 18 mins ago 11s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-08-23-976819
32 - 22 mins ago 18 mins ago 41s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-17-658069
31 - 22 mins ago 18 mins ago 11s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-17-638542
30 - 22 mins ago 18 mins ago 11s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-17-444940
29 - 22 mins ago 19 mins ago 23s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-17-046370
28 - 22 mins ago 19 mins ago 9s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-16-253004
27 - 22 mins ago 19 mins ago 10s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-11-055884
26 - 22 mins ago 19 mins ago 22s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-10-865600
25 - 22 mins ago 20 mins ago 10s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-10-820502
24 - 22 mins ago 20 mins ago 24s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-10-326193
23 - 22 mins ago 20 mins ago 8s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-10-090867
22 - 22 mins ago 20 mins ago 23s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-09-892731
21 - 22 mins ago 20 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-10-071996
20 - 22 mins ago 21 mins ago 8s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-09-257016
19 - 23 mins ago 21 mins ago 8s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-03-617943
18 - 23 mins ago 21 mins ago 43s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-03-333942
17 - 23 mins ago 21 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-07-03-139840
16 - 23 mins ago 21 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-330274
15 - 23 mins ago 21 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-358138
14 - 23 mins ago 21 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-294934
13 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-214243
12 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-195159
11 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-086991
10 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-247200
9 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-57-006863
8 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-521903
7 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-402809
6 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-395510
5 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-395694
4 - 23 mins ago 22 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-391061
3 - 23 mins ago 22 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-396063
2 - 23 mins ago 22 mins ago 9s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-392166
1 - 23 mins ago 23 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-2024-10-28-13-06-45-391832 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for fixing this @cblmemo! Left a comment about the performance.
For testing this PR, is it possible that we run the multi_echo
many times and see if we can find a way to increase the concurrency there to fully test this? Also, it would be good to run all the smoke tests for this to avoid any regression : )
sky/skylet/job_lib.py
Outdated
statuses = [] | ||
for job_id, status in zip(job_ids, job_statuses): | ||
for job_id in job_ids: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we update the ray==2.4.0
in the docstr above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to ray >= 2.4.0
as IIUC this is a newly added feature in ray 2.4.0?
Good point! I updated the max concurrency to 32 and lest see the performance. For smoke tests, I think Romil is running the smoke test for v0.7 release, do you think it is possible to run this PR simultaneously? cc @romilbhardwaj |
Oh I think when we increase the concurrency, the SSH failed first.. E 10-28 17:18:48 subprocess_utils.py:85] mux_client_request_session: session request failed: Session open refused by peer
E 10-28 17:18:48 subprocess_utils.py:85] kex_exchange_identification: read: Connection reset by peer
E 10-28 17:18:48 subprocess_utils.py:85] |
I suppose this can be fixed by disabling control master or can we just simulate the concurrent job submission on the remote cluster, ie generating a lot task app on the remote cluster and simultaneously run the command that is executed by the job submission script on the remote? |
Just tested with 128 jobs concurrently submitted to the cluster, and it works! sky queue
Fetching and parsing job queue...
Job queue of cluster t-multi-echo
ID NAME SUBMITTED STARTED DURATION RESOURCES STATUS LOG
130 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-458796099
129 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-535835570
128 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-815629051
127 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-867653101
126 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-394400685
125 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-389785135
124 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-018900992
123 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-755551186
122 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-385794973
121 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-969498972
120 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-101400629
119 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-293284469
118 sky-cmd 12 mins ago 2 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-289634392
117 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-466862257
116 sky-cmd 12 mins ago 2 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-758000335
115 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-438764491
114 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-276099579
113 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-017898999
112 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-164411403
111 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-860967090
110 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-170806416
109 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-287758911
108 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-979868870
107 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-840845623
106 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-521760666
105 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-000261083
104 sky-cmd 12 mins ago 3 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-081235373
103 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-760515548
102 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-221834421
101 sky-cmd 12 mins ago 3 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-818800957
100 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-967526310
99 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-130284141
98 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-169090472
97 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-078802642
96 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-143847823
95 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-809577207
94 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-769950108
93 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-359808016
92 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-158964465
91 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-960828980
90 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-769797168
89 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-015588439
88 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-508783014
87 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-979128860
86 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-861623317
85 sky-cmd 12 mins ago 4 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-041982627
84 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-773851310
83 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-458920059
82 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-789945828
81 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-976351809
80 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-429309100
79 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-923770202
78 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-851627131
77 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-371620506
76 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-822339383
75 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-295110263
74 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-767133123
73 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-756071274
72 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-855964327
71 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-838838861
70 sky-cmd 12 mins ago 5 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-436820357
69 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-771270967
68 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-229042113
67 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-843375038
66 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-297864600
65 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-796136091
64 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-920939611
63 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-040181697
62 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-189478444
61 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-415815623
60 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-827107651
59 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-363755437
58 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-280384024
57 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-847906293
56 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-207925089
55 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-013562788
54 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-280776183
53 sky-cmd 12 mins ago 6 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-223811559
52 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-433794013
51 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-751455325
50 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-086873008
49 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-794905572
48 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-167085868
47 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-864924090
46 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-446836871
45 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-310185851
44 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-310790162
43 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-908960204
42 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-899625716
41 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-361800334
40 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-888880276
39 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-796611576
38 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-202954705
37 sky-cmd 12 mins ago 7 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-753260322
36 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-824394249
35 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-933808239
34 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-803870670
33 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-033906207
32 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-757421981
31 sky-cmd 12 mins ago 10 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-516785396
30 sky-cmd 12 mins ago 8 mins ago 7s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-034736416
29 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-911820007
28 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-159799820
27 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-759455241
26 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-932884279
25 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-083763376
24 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-524548302
23 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-079312861
22 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-881186127
21 sky-cmd 12 mins ago 8 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-096913143
20 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-132760102
19 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-139847681
18 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-931096723
17 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-970157418
16 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-357840684
15 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-144199747
14 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-913760721
13 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-872838430
12 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-922907060
11 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-146186793
10 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-768559702
9 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-878908464
8 sky-cmd 12 mins ago 10 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-773364119
7 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-904813969
6 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238198-214890160
5 sky-cmd 12 mins ago 10 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-793735664
4 sky-cmd 12 mins ago 9 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-988767355
3 sky-cmd 12 mins ago 10 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730238197-766337147
2 sky-cmd 18 mins ago 17 mins ago 6s 1x[T4:0.5] SUCCEEDED ~/sky_logs/sky-1730237902-602957752
1 sky-cmd 19 mins ago 19 mins ago 2s 2x[T4:1] SUCCEEDED ~/sky_logs/sky-2024-10-29-14-34-54-678435 The script I use (need to manually set the internal IPs) is as follows: # run_all.sh
for i in {1..128}; do
bash ./run.sh &
done
wait # Waits for all background processes to finish # run.sh
TIMESTAMP=$(date +%s-%N)
RAY_TASK_LOG_DIR=~/sky_logs/sky-$TIMESTAMP
INTERNAL_IPS_1=10.128.0.123
INTERNAL_IPS_2=10.128.0.124
RAY_JOB_ID_ENV_VAR=$(/home/gcpuser/skypilot-runtime/bin/python -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_id = job_lib.add_job('"'"'sky-cmd'"'"', '"'"'memory'"'"', '"'"'sky-'"${TIMESTAMP}""'"', '"'"'1x[T4:0.5]'"'"');print("Job ID: " + str(job_id), flush=True)' | grep -oP 'Job ID: \K[0-9]+')
echo Submitting job $RAY_JOB_ID_ENV_VAR
cd ~/sky_workdir && mkdir -p $RAY_TASK_LOG_DIR && touch $RAY_TASK_LOG_DIR/run.log && { echo 'import getpass
import hashlib
import io
import os
import pathlib
import selectors
import shlex
import subprocess
import sys
import tempfile
import textwrap
import time
from typing import Dict, List, Optional, Tuple, Union
import ray
import ray.util as ray_util
from sky.skylet import autostop_lib
from sky.skylet import constants
from sky.skylet import job_lib
from sky.utils import log_utils
SKY_REMOTE_WORKDIR = '"'"'~/sky_workdir'"'"'
kwargs = dict()
# Only set the `_temp_dir` to SkyPilot'"'"'s ray cluster directory when
# the directory exists for backward compatibility for the VM
# launched before #1790.
if os.path.exists('"'"'/tmp/ray_skypilot'"'"'):
kwargs['"'"'_temp_dir'"'"'] = '"'"'/tmp/ray_skypilot'"'"'
ray.init(
address='"'"'auto'"'"',
namespace='"'"'__sky__'"${RAY_JOB_ID_ENV_VAR}"'__'"'"',
log_to_driver=True,
**kwargs
)
def get_or_fail(futures, pg) -> List[int]:
"""Wait for tasks, if any fails, cancel all unready."""
returncodes = [1] * len(futures)
# Wait for 1 task to be ready.
ready = []
# Keep invoking ray.wait if ready is empty. This is because
# ray.wait with timeout=None will only wait for 10**6 seconds,
# which will cause tasks running for more than 12 days to return
# before becoming ready.
# (Such tasks are common in serving jobs.)
# Reference: https://github.com/ray-project/ray/blob/ray-2.9.3/python/ray/_private/worker.py#L2845-L2846
while not ready:
ready, unready = ray.wait(futures)
idx = futures.index(ready[0])
returncodes[idx] = ray.get(ready[0])
while unready:
if returncodes[idx] != 0:
for task in unready:
# ray.cancel without force fails to kill tasks.
# We use force=True to kill unready tasks.
ray.cancel(task, force=True)
# Use SIGKILL=128+9 to indicate the task is forcely
# killed.
idx = futures.index(task)
returncodes[idx] = 137
break
ready, unready = ray.wait(unready)
idx = futures.index(ready[0])
returncodes[idx] = ray.get(ready[0])
# Remove the placement group after all tasks are done, so that
# the next job can be scheduled on the released resources
# immediately.
ray_util.remove_placement_group(pg)
sys.stdout.flush()
return returncodes
run_fn = None
futures = []
class _ProcessingArgs:
"""Arguments for processing logs."""
def __init__(self,
log_path: str,
stream_logs: bool,
start_streaming_at: str = '"'"''"'"',
end_streaming_at: Optional[str] = None,
skip_lines: Optional[List[str]] = None,
replace_crlf: bool = False,
line_processor: Optional[log_utils.LineProcessor] = None,
streaming_prefix: Optional[str] = None) -> None:
self.log_path = log_path
self.stream_logs = stream_logs
self.start_streaming_at = start_streaming_at
self.end_streaming_at = end_streaming_at
self.skip_lines = skip_lines
self.replace_crlf = replace_crlf
self.line_processor = line_processor
self.streaming_prefix = streaming_prefix
def _handle_io_stream(io_stream, out_stream, args: _ProcessingArgs):
"""Process the stream of a process."""
out_io = io.TextIOWrapper(io_stream,
encoding='"'"'utf-8'"'"',
newline='"'"''"'"',
errors='"'"'replace'"'"',
write_through=True)
start_streaming_flag = False
end_streaming_flag = False
streaming_prefix = args.streaming_prefix if args.streaming_prefix else '"'"''"'"'
line_processor = (log_utils.LineProcessor()
if args.line_processor is None else args.line_processor)
out = []
with open(args.log_path, '"'"'a'"'"', encoding='"'"'utf-8'"'"') as fout:
with line_processor:
while True:
line = out_io.readline()
if not line:
break
# start_streaming_at logic in processor.process_line(line)
if args.replace_crlf and line.endswith('"'"'\r\n'"'"'):
# Replace CRLF with LF to avoid ray logging to the same
# line due to separating lines with '"'"'\n'"'"'.
line = line[:-2] + '"'"'\n'"'"'
if (args.skip_lines is not None and
any(skip in line for skip in args.skip_lines)):
continue
if args.start_streaming_at in line:
start_streaming_flag = True
if (args.end_streaming_at is not None and
args.end_streaming_at in line):
# Keep executing the loop, only stop streaming.
# E.g., this is used for `sky bench` to hide the
# redundant messages of `sky launch` while
# saving them in log files.
end_streaming_flag = True
if (args.stream_logs and start_streaming_flag and
not end_streaming_flag):
print(streaming_prefix + line,
end='"'"''"'"',
file=out_stream,
flush=True)
if args.log_path != '"'"'/dev/null'"'"':
fout.write(line)
fout.flush()
line_processor.process_line(line)
out.append(line)
return '"'"''"'"'.join(out)
def process_subprocess_stream(proc, args: _ProcessingArgs) -> Tuple[str, str]:
"""Redirect the process'"'"'s filtered stdout/stderr to both stream and file"""
if proc.stderr is not None:
# Asyncio does not work as the output processing can be executed in a
# different thread.
# selectors is possible to handle the multiplexing of stdout/stderr,
# but it introduces buffering making the output not streaming.
with multiprocessing.pool.ThreadPool(processes=1) as pool:
err_args = copy.copy(args)
err_args.line_processor = None
stderr_fut = pool.apply_async(_handle_io_stream,
args=(proc.stderr, sys.stderr,
err_args))
# Do not launch a thread for stdout as the rich.status does not
# work in a thread, which is used in
# log_utils.RayUpLineProcessor.
stdout = _handle_io_stream(proc.stdout, sys.stdout, args)
stderr = stderr_fut.get()
else:
stdout = _handle_io_stream(proc.stdout, sys.stdout, args)
stderr = '"'"''"'"'
return stdout, stderr
def run_with_log(
cmd: Union[List[str], str],
log_path: str,
*,
require_outputs: bool = False,
stream_logs: bool = False,
start_streaming_at: str = '"'"''"'"',
end_streaming_at: Optional[str] = None,
skip_lines: Optional[List[str]] = None,
shell: bool = False,
with_ray: bool = False,
process_stream: bool = True,
line_processor: Optional[log_utils.LineProcessor] = None,
streaming_prefix: Optional[str] = None,
**kwargs,
) -> Union[int, Tuple[int, str, str]]:
"""Runs a command and logs its output to a file.
Args:
cmd: The command to run.
log_path: The path to the log file.
stream_logs: Whether to stream the logs to stdout/stderr.
require_outputs: Whether to return the stdout/stderr of the command.
process_stream: Whether to post-process the stdout/stderr of the
command, such as replacing or skipping lines on the fly. If
enabled, lines are printed only when '"'"'\r'"'"' or '"'"'\n'"'"' is found.
Returns the returncode or returncode, stdout and stderr of the command.
Note that the stdout and stderr is already decoded.
"""
assert process_stream or not require_outputs, (
process_stream, require_outputs,
'"'"'require_outputs should be False when process_stream is False'"'"')
log_path = os.path.expanduser(log_path)
dirname = os.path.dirname(log_path)
os.makedirs(dirname, exist_ok=True)
# Redirect stderr to stdout when using ray, to preserve the order of
# stdout and stderr.
stdout_arg = stderr_arg = None
if process_stream:
stdout_arg = subprocess.PIPE
stderr_arg = subprocess.PIPE if not with_ray else subprocess.STDOUT
with subprocess.Popen(cmd,
stdout=stdout_arg,
stderr=stderr_arg,
start_new_session=True,
shell=shell,
**kwargs) as proc:
try:
# The proc can be defunct if the python program is killed. Here we
# open a new subprocess to gracefully kill the proc, SIGTERM
# and then SIGKILL the process group.
# Adapted from ray/dashboard/modules/job/job_manager.py#L154
parent_pid = os.getpid()
daemon_script = os.path.join(
os.path.dirname(os.path.abspath(job_lib.__file__)),
'"'"'subprocess_daemon.py'"'"')
python_path = subprocess.check_output(
constants.SKY_GET_PYTHON_PATH_CMD,
shell=True,
stderr=subprocess.DEVNULL,
encoding='"'"'utf-8'"'"').strip()
daemon_cmd = [
python_path,
daemon_script,
'"'"'--parent-pid'"'"',
str(parent_pid),
'"'"'--proc-pid'"'"',
str(proc.pid),
]
# We do not need to set `start_new_session=True` here, as the
# daemon script will detach itself from the parent process with
# fork to avoid being killed by ray job. See the reason we
# daemonize the process in `sky/skylet/subprocess_daemon.py`.
subprocess.Popen(
daemon_cmd,
# Suppress output
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL,
# Disable input
stdin=subprocess.DEVNULL,
)
stdout = '"'"''"'"'
stderr = '"'"''"'"'
if process_stream:
if skip_lines is None:
skip_lines = []
# Skip these lines caused by `-i` option of bash. Failed to
# find other way to turn off these two warning.
# https://stackoverflow.com/questions/13300764/how-to-tell-bash-not-to-issue-warnings-cannot-set-terminal-process-group-and # pylint: disable=line-too-long
# `ssh -T -i -tt` still cause the problem.
skip_lines += [
'"'"'bash: cannot set terminal process group'"'"',
'"'"'bash: no job control in this shell'"'"',
]
# We need this even if the log_path is '"'"'/dev/null'"'"' to ensure the
# progress bar is shown.
# NOTE: Lines are printed only when '"'"'\r'"'"' or '"'"'\n'"'"' is found.
args = _ProcessingArgs(
log_path=log_path,
stream_logs=stream_logs,
start_streaming_at=start_streaming_at,
end_streaming_at=end_streaming_at,
skip_lines=skip_lines,
line_processor=line_processor,
# Replace CRLF when the output is logged to driver by ray.
replace_crlf=with_ray,
streaming_prefix=streaming_prefix,
)
stdout, stderr = process_subprocess_stream(proc, args)
proc.wait()
if require_outputs:
return proc.returncode, stdout, stderr
return proc.returncode
except KeyboardInterrupt:
# Kill the subprocess directly, otherwise, the underlying
# process will only be killed after the python program exits,
# causing the stream handling stuck at `readline`.
subprocess_utils.kill_children_processes()
raise
def make_task_bash_script(codegen: str,
env_vars: Optional[Dict[str, str]] = None) -> str:
# set -a is used for exporting all variables functions to the environment
# so that bash `user_script` can access `conda activate`. Detail: #436.
# Reference: https://www.gnu.org/software/bash/manual/html_node/The-Set-Builtin.html # pylint: disable=line-too-long
# DEACTIVATE_SKY_REMOTE_PYTHON_ENV: Deactivate the SkyPilot runtime env, as
# the ray cluster is started within the runtime env, which may cause the
# user program to run in that env as well.
# PYTHONUNBUFFERED is used to disable python output buffering.
script = [
textwrap.dedent(f"""\
#!/bin/bash
source ~/.bashrc
set -a
. $(conda info --base 2> /dev/null)/etc/profile.d/conda.sh > /dev/null 2>&1 || true
set +a
{constants.DEACTIVATE_SKY_REMOTE_PYTHON_ENV}
export PYTHONUNBUFFERED=1
cd {constants.SKY_REMOTE_WORKDIR}"""),
]
if env_vars is not None:
for k, v in env_vars.items():
script.append(f'"'"'export {k}={shlex.quote(str(v))}'"'"')
script += [
codegen,
'"'"''"'"', # New line at EOF.
]
script = '"'"'\n'"'"'.join(script)
return script
def add_ray_env_vars(
env_vars: Optional[Dict[str, str]] = None) -> Dict[str, str]:
# Adds Ray-related environment variables.
if env_vars is None:
env_vars = {}
ray_env_vars = [
'"'"'CUDA_VISIBLE_DEVICES'"'"', '"'"'RAY_CLIENT_MODE'"'"', '"'"'RAY_JOB_ID'"'"',
'"'"'RAY_RAYLET_PID'"'"', '"'"'OMP_NUM_THREADS'"'"'
]
env_dict = dict(os.environ)
for env_var in ray_env_vars:
if env_var in env_dict:
env_vars[env_var] = env_dict[env_var]
return env_vars
def run_bash_command_with_log(bash_command: str,
log_path: str,
env_vars: Optional[Dict[str, str]] = None,
stream_logs: bool = False,
with_ray: bool = False):
with tempfile.NamedTemporaryFile('"'"'w'"'"', prefix='"'"'sky_app_'"'"',
delete=False) as fp:
bash_command = make_task_bash_script(bash_command, env_vars=env_vars)
fp.write(bash_command)
fp.flush()
script_path = fp.name
# Need this `-i` option to make sure `source ~/.bashrc` work.
inner_command = f'"'"'/bin/bash -i {script_path}'"'"'
subprocess_cmd: Union[str, List[str]]
subprocess_cmd = inner_command
return run_with_log(
subprocess_cmd,
log_path,
stream_logs=stream_logs,
with_ray=with_ray,
# Disable input to avoid blocking.
stdin=subprocess.DEVNULL,
shell=True)
run_bash_command_with_log = ray.remote(run_bash_command_with_log)
if hasattr(autostop_lib, '"'"'set_last_active_time_to_now'"'"'):
autostop_lib.set_last_active_time_to_now()
job_lib.set_status('"${RAY_JOB_ID_ENV_VAR}"', job_lib.JobStatus.PENDING)
pg = ray_util.placement_group([{"CPU": 0.5, "T4": 0.5, "GPU": 0.5}], '"'"'STRICT_SPREAD'"'"')
plural = '"'"'s'"'"' if 1 > 1 else '"'"''"'"'
node_str = f'"'"'1 node{plural}'"'"'
# We have this `INFO: Tip:` message only for backward
# compatibility, because if a cluster has the old SkyPilot version,
# it relies on this message to start log streaming.
# This message will be skipped for new clusters, because we use
# start_streaming_at for the `Waiting for task resources on`
# message.
# TODO: Remove this message in v0.9.0.
message = ('"'"'�[2m├── �[0m�[2mINFO: '"'"'
'"'"'Tip: use Ctrl-C to exit log streaming, not kill '"'"'
'"'"'the job.�[0m\n'"'"')
message += ('"'"'�[2m├── �[0m�[2m'"'"'
'"'"'Waiting for task resources on '"'"'
f'"'"'{node_str}.�[0m'"'"')
print(message, flush=True)
# FIXME: This will print the error message from autoscaler if
# it is waiting for other task to finish. We should hide the
# error message.
ray.get(pg.ready())
print('"'"'\x1b[2m└── \x1b[0mJob started. Streaming logs... \x1b[2m(Ctrl-C to exit log streaming; job will not be killed)\x1b[0m'"'"', flush=True)
job_lib.set_job_started('"${RAY_JOB_ID_ENV_VAR}"')
job_lib.scheduler.schedule_step()
@ray.remote
def check_ip():
return ray.util.get_node_ip_address()
gang_scheduling_id_to_ip = ray.get([
check_ip.options(
num_cpus=0.5,
scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(
placement_group=pg,
placement_group_bundle_index=i
)).remote()
for i in range(pg.bundle_count)
])
cluster_ips_to_node_id = {ip: i for i, ip in enumerate(['"'${INTERNAL_IPS_1}'"', '"'${INTERNAL_IPS_2}'"'])}
job_ip_rank_list = sorted(gang_scheduling_id_to_ip, key=cluster_ips_to_node_id.get)
job_ip_rank_map = {ip: i for i, ip in enumerate(job_ip_rank_list)}
job_ip_list_str = '"'"'\n'"'"'.join(job_ip_rank_list)
sky_env_vars_dict = {}
sky_env_vars_dict['"'"'SKYPILOT_NODE_IPS'"'"'] = job_ip_list_str
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_NODE_IPS'"'"'] = job_ip_list_str
sky_env_vars_dict['"'"'SKYPILOT_NUM_NODES'"'"'] = len(job_ip_rank_list)
sky_env_vars_dict['"'"'SKYPILOT_TASK_ID'"'"'] = '"'"'sky-2024-10-29-10-09-37-475191_multi-echo-test_1'"'"'
sky_env_vars_dict['"'"'SKYPILOT_CLUSTER_INFO'"'"'] = '"'"'{"cluster_name": "multi-echo-test", "cloud": "GCP", "region": "us-central1", "zone": "us-central1-a"}'"'"'
script = '"'"'echo 0; sleep 5'"'"'
if run_fn is not None:
script = run_fn(0, gang_scheduling_id_to_ip)
if script is not None:
sky_env_vars_dict['"'"'SKYPILOT_NUM_GPUS_PER_NODE'"'"'] = 1
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_NUM_GPUS_PER_NODE'"'"'] = 1
ip = gang_scheduling_id_to_ip[0]
rank = job_ip_rank_map[ip]
if len(cluster_ips_to_node_id) == 1: # Single-node task on single-node cluter
name_str = '"'"'None,'"'"' if None != None else '"'"'task,'"'"'
log_path = os.path.expanduser(os.path.join('"'""${RAY_TASK_LOG_DIR}/tasks""'"', '"'"'run.log'"'"'))
else: # Single-node or multi-node task on multi-node cluster
idx_in_cluster = cluster_ips_to_node_id[ip]
if cluster_ips_to_node_id[ip] == 0:
node_name = '"'"'head'"'"'
else:
node_name = f'"'"'worker{idx_in_cluster}'"'"'
name_str = f'"'"'{node_name}, rank={rank},'"'"'
log_path = os.path.expanduser(os.path.join('"'""${RAY_TASK_LOG_DIR}/tasks""'"', f'"'"'{rank}-{node_name}.log'"'"'))
sky_env_vars_dict['"'"'SKYPILOT_NODE_RANK'"'"'] = rank
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_NODE_RANK'"'"'] = rank
sky_env_vars_dict['"'"'SKYPILOT_INTERNAL_JOB_ID'"'"'] = '"${RAY_JOB_ID_ENV_VAR}"'
# Backward compatibility: Environment starting with `SKY_` is
# deprecated. Remove it in v0.9.0.
sky_env_vars_dict['"'"'SKY_INTERNAL_JOB_ID'"'"'] = '"${RAY_JOB_ID_ENV_VAR}"'
futures.append(run_bash_command_with_log \
.options(name=name_str, num_cpus=0.5, resources={"T4": 0.5}, num_gpus=0.5, scheduling_strategy=ray.util.scheduling_strategies.PlacementGroupSchedulingStrategy(placement_group=pg, placement_group_bundle_index=0)) \
.remote(
script,
log_path,
env_vars=sky_env_vars_dict,
stream_logs=True,
with_ray=True,
))
returncodes = get_or_fail(futures, pg)
if sum(returncodes) != 0:
job_lib.set_status('"${RAY_JOB_ID_ENV_VAR}"', job_lib.JobStatus.FAILED)
# Schedule the next pending job immediately to make the job
# scheduling more efficient.
job_lib.scheduler.schedule_step()
# This waits for all streaming logs to finish.
time.sleep(0.5)
reason = '"'"''"'"'
# 139 is the return code of SIGSEGV, i.e. Segmentation Fault.
if any(r == 139 for r in returncodes):
reason = '"'"'(likely due to Segmentation Fault)'"'"'
print('"'"'ERROR: �[31mJob '"${RAY_JOB_ID_ENV_VAR}"' failed with '"'"'
'"'"'return code list:�[0m'"'"',
returncodes,
reason,
flush=True)
# Need this to set the job status in ray job to be FAILED.
sys.exit(1)
else:
job_lib.set_status('"${RAY_JOB_ID_ENV_VAR}"', job_lib.JobStatus.SUCCEEDED)
# Schedule the next pending job immediately to make the job
# scheduling more efficient.
job_lib.scheduler.schedule_step()
# This waits for all streaming logs to finish.
time.sleep(0.5)
' > ~/.sky/sky_app/sky_job_$RAY_JOB_ID_ENV_VAR; } && /home/gcpuser/skypilot-runtime/bin/python -u -c 'import os;import getpass;from sky.skylet import job_lib, log_lib, constants;job_lib.scheduler.queue('"${RAY_JOB_ID_ENV_VAR}"','"'"'cd ~/sky_workdir && /home/gcpuser/skypilot-runtime/bin/python /home/gcpuser/skypilot-runtime/bin/ray job submit --address=http://127.0.0.1:8266 --submission-id '"${RAY_JOB_ID_ENV_VAR}"'-$(whoami) --no-wait "/home/gcpuser/skypilot-runtime/bin/python -u ~/.sky/sky_app/sky_job_'"${RAY_JOB_ID_ENV_VAR}"' > '"${RAY_TASK_LOG_DIR}"'/run.log 2> /dev/null"'"'"')'
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for fixing this @cblmemo! This fixes an important issue an user was facing. LGTM!
sky/skylet/job_lib.py
Outdated
for job_detail in job_detail_lists: | ||
if job_detail.submission_id in ray_job_ids_set: | ||
job_details[job_detail.submission_id] = job_detail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, why we don't only keep the job within the ray_job_ids_set
? It's quite minor but may save some memory : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! It is from a refactoring back to query job list instead of independent job status. Change back now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rerun the multi echo test and still no error found :)) Will merge after all smoke test passed!
All smoke test besides #4211 (also failed on master) passed! Merging now. |
Fixes #4133.
For
examples/multi_echo.py
, on the latest master, the failure rate is about 2% (5 out of 256 jobs). This PR has no failure.Tested (run the relevant ones):
bash format.sh
examples/multi_echo.py
with 256 jobspytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh