Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: Add podgroups statistics #3751

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JesseStutler
Copy link
Contributor

@JesseStutler JesseStutler commented Sep 26, 2024

fix #3597, related proposal pr: #3750

Implementation

  • In syncQueue of queue controller, concurrently using UpdateStatus will cause conflict, use ApplyStatus instead, and need to add patch permission of vc-controller for patch/status. And we record the statistics of podgroups in each state in queue as metrics to be exported outside, we don't update the statistics of podgroups in queue's status now.
  • vcctl get -n [name] and vcctl list display the statistics of podgroups in each state from queue's status directly, instead we do one more step, query the podgroups owend in the queue, stat the counts of podgroups in each state at the vcctl side, and then display them.

Verification

  • vcctl list
    • When there are no podgroups in queue, list is normal:
      image
    • When there are podgroups in queue, list can also get the statistics of podgroups:
      image
  • vcctl get
    • get is normal:
      image
  • metrics
    image

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hzxuzhonghu
You can assign the PR to them by writing /assign @hzxuzhonghu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 26, 2024
@@ -177,3 +188,10 @@ func isControllerEnabled(name string, controllers []string) bool {
// if we get here, there was no explicit inclusion or exclusion
return hasStar
}

func promHandler() http.Handler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is duplicated with the func in scheduler.

@@ -58,7 +58,7 @@ rules:
verbs: ["get", "create", "delete", "update"]
- apiGroups: ["scheduling.incubator.k8s.io", "scheduling.volcano.sh"]
resources: ["podgroups", "queues", "queues/status"]
verbs: ["get", "list", "watch", "create", "delete", "update"]
verbs: ["get", "list", "watch", "create", "delete", "update", "patch"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is update still need?

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something I'm missing? Where is the logic for deleting the update queue status

@@ -37,6 +37,7 @@ const (
defaultMaxRequeueNum = 15
defaultSchedulerName = "volcano"
defaultHealthzAddress = ":11251"
defaultListenAddress = ":8080"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a different port be better?

  1. Make a distinction with the scheduler port.
  2. Port 8080 is too common and prone to conflicts.

@JesseStutler
Copy link
Contributor Author

Is there something I'm missing? Where is the logic for deleting the update queue status

Sorry I don't get the point, you mean that when the queue is deleted, the metrics need also be deleted, right?

@hwdef
Copy link
Member

hwdef commented Oct 9, 2024

Sorry I didn't express myself clearly.
Our previous discussion was not to update some states of the queue, such as the number of pending and running in the queue.
So should we delete this part of the code?

@JesseStutler
Copy link
Contributor Author

Oh yes now the pending/running/unknown/inqueue/completed pg statistics won't be persisted into etcd, see here:

queueStatusApply := v1beta1apply.QueueStatus().WithState(queueStatus.State).WithAllocated(queueStatus.Allocated)
	queueApply := v1beta1apply.Queue(queue.Name).WithStatus(queueStatusApply)
	if _, err := c.vcClient.SchedulingV1beta1().Queues().ApplyStatus(context.TODO(), queueApply, metav1.ApplyOptions{FieldManager: controllerName}); err != nil {
		klog.Errorf("Failed to apply status of Queue %s: %v.", queue.Name, err)
		return err
	}

You can see above that the data called queueStatusApply which need to update is only state and allocated resource, won't update those statistics now.

@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 25, 2024
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The volcano controller may have memory leak issues in large-scale clusters
4 participants