Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix job controller reports duplicate warnings #3746

Merged
merged 1 commit into from
Sep 27, 2024

Conversation

liuyuanchun11
Copy link
Contributor

@liuyuanchun11 liuyuanchun11 commented Sep 24, 2024

When the syncJob is processed in the job_controller, the condition timestamp of the podGroup is determined, and if the latest status is Unschedulable, a warning event is recorded. If the latest status is Scheduled, no warning event will be logged even if there is Unschedulable in the condtition.

fix issue #3745

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 24, 2024
var latestConditionMsg string

// Get the latest condition by timestamp
for _, condition := range podGroup.Status.Conditions {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little doubt here, the unschedulable condition will append to unschedulable condition?What if just override the unschedulable condition because pg is already schedulable and no need to concern the unschedulable condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to solve the problem that the existing podgroup will repeatedly log warning events when the controller restarts. The condition update issue needs to be fixed separately.

Copy link
Member

@Monokaix Monokaix Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add ut here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UT testcase has been added

@Monokaix
Copy link
Member

Seems that the main cause is that there are duplicated condition type Unschedulable and Schedulable, both of them is used to indicate the scheduling event,which I think is duplicated,just keep one of them is ok,and another problem is that every pg will have a condition contains Unschedulable like the following, even though pg can be scheduled first time, which is confused and not necessary.

status:
  conditions:
  - lastTransitionTime: "2024-09-25T02:15:18Z"
    message: '1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable'
    reason: NotEnoughResources
    status: "True"
    transitionID: b2adeea6-9812-4315-a3a4-6411c4b364c9
    type: Unschedulable
  - lastTransitionTime: "2024-09-25T02:15:20Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: f9b8c487-41ad-49a6-9474-3e732b5ae29e
    type: Scheduled
  phase: Running
  running: 1

So maybe we can add a new condition type like Enqueueable, and just update the Enqueueable condition and no need to update Schedulable contidtion, the pr #3045 is doing the similar thing and we can consider them together.

@Monokaix
Copy link
Member

cc @lowang-bh

@liuyuanchun11
Copy link
Contributor Author

Seems that the main cause is that there are duplicated condition type Unschedulable and Schedulable, both of them is used to indicate the scheduling event,which I think is duplicated,just keep one of them is ok,and another problem is that every pg will have a condition contains Unschedulable like the following, even though pg can be scheduled first time, which is confused and not necessary.

status:
  conditions:
  - lastTransitionTime: "2024-09-25T02:15:18Z"
    message: '1/0 tasks in gang unschedulable: pod group is not ready, 1 minAvailable'
    reason: NotEnoughResources
    status: "True"
    transitionID: b2adeea6-9812-4315-a3a4-6411c4b364c9
    type: Unschedulable
  - lastTransitionTime: "2024-09-25T02:15:20Z"
    reason: tasks in gang are ready to be scheduled
    status: "True"
    transitionID: f9b8c487-41ad-49a6-9474-3e732b5ae29e
    type: Scheduled
  phase: Running
  running: 1

So maybe we can add a new condition type like Enqueueable, and just update the Enqueueable condition and no need to update Schedulable contidtion, the pr #3045 is doing the similar thing and we can consider them together.

I've reworked the function logic a bit, and now it can support the subsequent extension of the new unscheduled condition

@Monokaix
Copy link
Member

We should record another issue to track the following problem:

  • duplicated condition update with Unschedulable and Scheduled, just retain one Scheduled is enough.
  • and add an extended condition to express the Unenqueueable event.
  • every pg will record a Unschedulable type event fisrt no matter whether resources are enough and then schedule successfully.

@JesseStutler
Copy link
Contributor

We should record another issue to track the following problem:

  • duplicated condition update with Unschedulable and Scheduled, just retain one Scheduled is enough.
  • and add an extended condition to express the Unenqueueable event.
  • every pg will record a Unschedulable type event fisrt no matter whether resources are enough and then schedule successfully.

Yes I think we are ok with pr to not record warning events if the last condition is not scheduled, but we should keep resolving these problems.

@liuyuanchun11
Copy link
Contributor Author

We should record another issue to track the following problem:

  • duplicated condition update with Unschedulable and Scheduled, just retain one Scheduled is enough.
  • and add an extended condition to express the Unenqueueable event.
  • every pg will record a Unschedulable type event fisrt no matter whether resources are enough and then schedule successfully.

I submitted an issue for podgroup unschedulable condition. #3749

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 27, 2024
Signed-off-by: liuyuanchun <superpig13@hotmail.com>
@Monokaix
Copy link
Member

/lgtm
/approve

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Sep 27, 2024
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 27, 2024
@volcano-sh-bot volcano-sh-bot merged commit 2d768d6 into volcano-sh:master Sep 27, 2024
14 checks passed
@liuyuanchun11 liuyuanchun11 deleted the fixPgWarning branch September 30, 2024 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants