Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update the volcano metric document. #3782

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fengruotj
Copy link
Contributor

fix: the metric doc lacks updates Issue 3118

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 21, 2024
@fengruotj
Copy link
Contributor Author

I've already update the volcano metric document by pkg/scheduler/metrics/metrics.go. Are there any other Metrics files I should bring?

@fengruotj
Copy link
Contributor Author

/assign @Monokaix

| unschedule_task_count | Counter | `job`=<job_id> | The number of tasks failed to schedule |
| unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration |
| job_retry_counts | Counter | `job`=<job_id> | The number of retry times of one job |
| **Metric Name** | **Metric Type** | **Labels** | **Description** |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are pod_schedule_errors and pod_schedule_successes deleted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I've not found the metrics, like pod_schedule_errors and pod_schedule_successes .

Copy link
Contributor Author

@fengruotj fengruotj Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, volcano uses the schedule_attempts_total counter metrics.

@@ -1,39 +1,40 @@
## Scheduler Monitoring

## Introduction
Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427).
Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring volcano behaviour over time. There's also requirement like to monitor volcano in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/volcano/issues/427).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the issue link need not to be changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've already fix it. In addition, I add more metrics defination.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign monokaix
You can assign the PR to them by writing /assign @monokaix in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fengruotj fengruotj force-pushed the doc-volcano-metrics branch 3 times, most recently from 58f0921 to d40ef51 Compare October 22, 2024 05:12
@JesseStutler
Copy link
Contributor

Could you also add metrics with these PR? #3650
It adds two counter metrics called job_completed_phase_count and job_failed_phase_count, recording the information of failed/completed vcjob, if the vc-controller meets a failed/completed vcjob, it will record the metric.

Signed-off-by: tanjie.master <tanjiemaster@gmail.com>
@fengruotj
Copy link
Contributor Author

Could you also add metrics with these PR? #3650 It adds two counter metrics called job_completed_phase_count and job_failed_phase_count, recording the information of failed/completed vcjob, if the vc-controller meets a failed/completed vcjob, it will record the metric.

@fengruotj fengruotj closed this Oct 23, 2024
@fengruotj
Copy link
Contributor Author

fengruotj commented Oct 23, 2024

Could you also add metrics with these PR? #3650 It adds two counter metrics called job_completed_phase_count and job_failed_phase_count, recording the information of failed/completed vcjob, if the vc-controller meets a failed/completed vcjob, it will record the metric.

Yes, I've already added it.

@fengruotj fengruotj reopened this Oct 23, 2024
@Monokaix
Copy link
Member

/ok-to-test
/lgtm

@volcano-sh-bot volcano-sh-bot added lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Oct 23, 2024
@Monokaix
Copy link
Member

cc @hwdef @lowang-bh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. retest-not-required-docs-only size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants