Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build health metrics dimensions #952

Open
mrh666 opened this issue Sep 23, 2024 · 7 comments
Open

Build health metrics dimensions #952

mrh666 opened this issue Sep 23, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@mrh666
Copy link

mrh666 commented Sep 23, 2024

What feature do you want to see added?

@cyrille-leclerc on that page https://plugins.jenkins.io/opentelemetry/ there is a screenshot of kibana https://raw.githubusercontent.com/jenkinsci/opentelemetry-plugin/master/docs/images/kibana_jenkins_overview_dashboard.png with all graphs I need, e.g. job duration, failed steps, long steps, etc. How can we get those metrics exported to Dynatrace?

Upstream changes

No response

Are you interested in contributing this feature?

No response

@mrh666 mrh666 added the enhancement New feature or request label Sep 23, 2024
@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Sep 24, 2024

Report per pipeline

A solution we are looking at could be to produce an histogram metric per pipeline and per result. It would be inspired by the standardized http.server.request.duration metric.

I was waiting for the OTel CI/CD SIG to standardize such a metric, work is in progress with:

⚠️ I'm worried about the cardinality of such a metric as we can potentially produce 5 x count(pipeline) histograms which is a lot.

@christophe-kamphaus-jemmic did you put thoughts on such metrics?

The metric could look like

ci.pipeline.run.duration: histogram {
   // Pipeline full name
   // See org.jenkinsci.plugins.workflow.job.WorkflowJob#getFullName()
   ci.pipeline.id="/my-team/my-war/master",
   // see hudson.model.Run#getResult() 
   // SUCCESS, UNSTABLE, FAILURE, NOT_BUILT, ABORTED
   ci.pipeline.run.result="SUCCESS"
}

Report per pipeline step

High cardinality problems look even more a risk here. I'm wondering if we should not stick to solve this doing metrics queries on the traces similar to what TraceQL metrics queries offer.

Controlling cardinality

I'm thinking of helping Jenkins admins control cardinality of such metrics enabling allow & deny lists of pipeline names as we have seen Jenkins instances with thousands of pipelines.

@mrh666 Is it he kind of ideas you had in mind?

@mrh666
Copy link
Author

mrh666 commented Sep 24, 2024

@cyrille-leclerc that's exactly what I have in mind!

I'm worried about the cardinality of such a metric as we can potentially produce 5 x count(pipeline) histograms which is a lot.

You have reasonable worries about cardinality. In influx world it's easily can kill all the DB performance. But just make it optional. Something like otel.exporter.otlp.metrics.build_health.enabled

if we should not stick to solve this doing metrics queries on the traces similar to

In Dynatrace world it's impossible or close to impossible. I've digging into such a functionality and not achieved any results.

@cyrille-leclerc
Copy link
Contributor

Thanks @mrh666 . Can you please share with us:

  • Total count of pipelines
  • Count of pipelines for which you want performance metrics
  • Could it be possible with allo/deny lists based on regex to just collect metrics on the pipelines that matter to you?

Same question for build steps

@mrh666
Copy link
Author

mrh666 commented Sep 24, 2024

In the current project:
We have 24 pipelines running at the moment
12 of those required pipeline metrics

Could it be possible with allo/deny lists based on regex to just collect metrics on the pipelines that matter to you?

This one is really important!

@cyrille-leclerc
Copy link
Contributor

cyrille-leclerc commented Oct 1, 2024

Here is a proposal:

  • Allow and Deny lists using regex to specify the job names for which we create a time series to control cardinality
  • Histogram metric
ci.pipeline.run.duration: unit=second {
   ci.pipeline.id: if (in-allow-lit && ! in-deny-list) ?
      hudson.model.Job.getParent().getFullName() :
      "#other#"
   ci.pipeline.run.result: hudson.model.Result
   ci.pipeline.run.completed: hudson.model.Result.isCompleted()

}

See:

Feedback welcome cc @mrh666

@christophe-kamphaus-jemmic
Copy link
Contributor

I was waiting for the OTel CI/CD SIG to standardize such a metric

Indeed work in the OTel CI/CD SIG related to metrics is in progress.
We are currently standardizing metrics related to VCS and we plan to follow that up with metrics related to pipelines, queues and agents.
ci.pipeline.run.duration: histogram sounds like a good metric. I will propose it in the SIG.

One issue I can see with using a histogram is that the chosen buckets might not give enough insight to take any action. Some jobs might be of very short duration, while others could take hours or even days to complete.
So how would you define the buckets?

I had very good success in using metrics queries on traces/spans for job duration as well as stage duration using the steps I introduced in #827 (example use in a pipeline here: #811 (comment)). This allowed me to have very detailed statistics (eg. average duration per day or job) and is filterable per job.

I'm thinking of helping Jenkins admins control cardinality of such metrics enabling allow & deny lists of pipeline names as we have seen Jenkins instances with thousands of pipelines.

For sure cardinality is an issue when the number of time series scales with a dynamic value like the number of jobs managed by Jenkins. It's not as bad as when we would have a separate time series per build, but still it needs to be managed. (prometheus-plugin has per-build metrics guarded by a checkbox config option)

Controlling which jobs generate this metric on Jenkins-side I think is a very good option.

Alternatively it's also possible to filter later:

  • using metric relabelling rules in Prometheus or ServiceMonitor
  • with opentelemetry-collector a filterprocessor could be used
    processors:
      filter/ci:
        error_mode: ignore
        metrics:
          metric:
              - 'name == "ci.pipeline.run.duration" and not(IsMatch(attributes["ci.pipeline.id"], "my-otel-pipelines.*"))'
    

@cyrille-leclerc
Copy link
Contributor

Cc @miraccan00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants