-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix panic issue with proportional scheduling #2968
Fix panic issue with proportional scheduling #2968
Conversation
/assign @Thor-wl |
dc2acb6
to
fa7a78a
Compare
/hold |
@lowang-bh Hi can you describe why you set hold label? Is there any problems with this PR, can i change it to make better? |
Hi, @Cdayz can you see this comment? In my opinion, a more elegent fix is to replace comapre function with comparing cpu and memory independently. |
@@ -33,6 +33,13 @@ func checkNodeResourceIsProportional(task *api.TaskInfo, node *api.NodeInfo, pro | |||
return status, nil | |||
} | |||
} | |||
|
|||
if node.Idle.LessPartly(task.Resreq, api.Zero) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a good way is don't use the compare function. Just compare CPU and Memory differently at this position:
volcano/pkg/scheduler/plugins/predicates/proportional.go
Lines 40 to 42 in 5302995
r := node.Idle.Clone() | |
r = r.Sub(task.Resreq) | |
if r.MilliCPU < cpuReserved || r.Memory < memoryReserved { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, probably it's not a right decision, this PR fixes bug when node have ANY of resources lower than requested by task, and it leads to panic on this line.
r = r.Sub(task.Resreq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think that maybe this pr made in a wrong place, because for example in allocate
action of scheduler, checking that node resources sufficient for run task made before run other checks.
volcano/pkg/scheduler/actions/allocate/allocate.go
Lines 100 to 118 in 5302995
predicateFn := func(task *api.TaskInfo, node *api.NodeInfo) ([]*api.Status, error) { | |
// Check for Resource Predicate | |
if ok, reason := task.InitResreq.LessEqualWithReason(node.FutureIdle(), api.Zero); !ok { | |
return nil, api.NewFitError(task, node, reason) | |
} | |
var statusSets util.StatusSets | |
statusSets, err := ssn.PredicateFn(task, node) | |
if err != nil { | |
return nil, fmt.Errorf("predicates failed in allocate for task <%s/%s> on node <%s>: %v", | |
task.Namespace, task.Name, node.Name, err) | |
} | |
if statusSets.ContainsUnschedulable() || statusSets.ContainsUnschedulableAndUnresolvable() || | |
statusSets.ContainsErrorSkipOrWait() { | |
return nil, fmt.Errorf("predicates failed in allocate for task <%s/%s> on node <%s>, status is not success", | |
task.Namespace, task.Name, node.Name) | |
} | |
return nil, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add this check to all actions in predicateFn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lowang-bh check my comments please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add this check to all actions in predicateFn ?
It doesn't all need resorce compare in predicateFn, eg, preempt and backfill action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about use:
if r.MilliCPU - task.Resreq.MilliCPU < cpuReserved || r.Memory - task.Resreq.Memory < memoryReserved
to replace
r := node.Idle.Clone()
r = r.Sub(task.Resreq)
if r.MilliCPU < cpuReserved || r.Memory < memoryReserved {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lowang-bh hey, i changed as you suggested
/hold cancel |
Signed-off-by: Nikita <capitan.crazy.dayz@gmail.com>
fa7a78a
to
46e014c
Compare
r = r.Sub(task.Resreq) | ||
if r.MilliCPU < cpuReserved || r.Memory < memoryReserved { | ||
|
||
if node.Idle.MilliCPU-task.Resreq.MilliCPU < cpuReserved || node.Idle.Memory-task.Resreq.Memory < memoryReserved { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Cdayz Hi, I'm sorry for the late review. Can you demonstrate how this introduce panic? I've not gotten the point here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if node.idle has some dimension less than requst, there will be a assert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if node.idle has some dimension less than requst, there will be a assert.
OK. That's reasonable. I'm OK about the fix.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Thor-wl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
No description provided.