Add distributed tracing for k6-backed checks #923

roobre · 2024-10-02T15:17:29Z

This PR adds opt-in distributed tracing support in the agent, using the opentelemetry-go sdk and OTLP transport. These traces are meant to help probe operators diagnose and understand private probes, and they are not sent to the stacks where the checks are running, but rather to a single OTLP endpoint configured by the probe operator.

For now, tracing is enabled only for k6-backed checks executed by remote runners. However, the mechanism can easily be extended to other check types and local k6 runners.

The code follows the opentelemetry tracing conventions, relying on context.Context to propagate spans across the code: A function creates a span, adds it to the context, and passes that context down so the process can repeat downstream and spans be correctly linked.

Additionally, this PR also adds a convenience tracing package. This package creates a TracerProvider configured to output otlp traces using variables from the environment (See upstream docs), and injects it into the context. Further down the stack, probers can retrieve this TracerProvider from that context. This is inspired by the same pattern above, but allows propagating the TracerProvider down even when no span is currently active. The alternative to this would be to manually pass down a TracerProvider to all objects that initiate traces (i.e. create root spans), but given that we have quite a lot of them already satisfying established interfaces, I thought this option would be less intrusive.

mem

Thank you.

Sorry about the comments all over the place. I really dislike the otel API, it's messy, complicated and it has very low discoverability. A lot of the comments come from that place.

internal/tracing/tracing.go

cmd/synthetic-monitoring-agent/main.go

mem · 2024-10-09T02:20:06Z

internal/prober/browser/browser.go

@@ -60,12 +62,18 @@ func (p Prober) Name() string {
 }

 func (p Prober) Probe(ctx context.Context, target string, registry *prometheus.Registry, logger logger.Logger) (bool, float64) {
+	ctx, probeSpan := tracing.TracerProvider(ctx).Tracer("").Start(ctx, "browser check")


I almost want to say the span should start one level up.

The fact that the code is repeated across a bunch of different probers makes me think that's probably correct.

Tracer("") is one of those "ugh!" otel things...

and Tracer(...) is the other one, where Java slipped into Go. What that thing is doing is going to a global (!!!) registry of tracers and pulling one out. I have always found it so confusing that I almost want to say that's what should be in the context instead of the tracer provider.

But that's the reason why I think the span should start one level up, because it would be useful to have the check ID, tenant ID, region ID, probe name, and possibly a couple of bits of info in the trace, and that should be added to the span.

It might make sense for the tracer to have the name of the probe as an attribute, but I never remember which bit of info ends up where. I think the cluster and instance are getting added by Alloy?

Also, the prober knows its name, you should use that instead of hard coding "browser check"

I almost want to say the span should start one level up.

I'm fine moving it one up, what I'm not sure is if we do want traces for non-k6 checks. The possibility is there, but do we actually have a use for them?

and Tracer(...) is the other one, where Java slipped into Go. What that thing is doing is going to a global (!!!) registry of tracers and pulling one out. I have always found it so confusing that I almost want to say that's what should be in the context instead of the tracer provider.

I don't think that's true. Each TracerProvider keeps a list of Tracers indexed by name, and we only create one TracerProvider:
https://github.com/open-telemetry/opentelemetry-go/blob/sdk/v1.30.0/sdk/trace/provider.go#L132

That's a bit awful but it means that calling TracerProvider(ctx).Tracer("") is cheap, the tracer is created only once. That's why it is repeated all over the code.

Not trying to say otel design is good or bad, but I try to say this is how they want it to be used.

I'm fine moving it one up, what I'm not sure is if we do want traces for non-k6 checks. The possibility is there, but do we actually have a use for them?

I would rather have the consistency.

If traces for protocol checks aren't useful, we can configure a sampling policy to drop those.

internal/k6runner/http.go

mem · 2024-10-09T02:32:09Z

internal/k6runner/http.go

+	httpClient := http.Client{
+		Transport: otelhttp.NewTransport(
+			http.DefaultTransport,
+			otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),
+			// Span names do not include method and path by default to avoid cardinality explosion with paths containing
+			// IDs. As this is not the case with this endpoint, we use a custom formatter that includes both.
+			otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string {
+				return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
+			}),
+			otelhttp.WithPropagators(propagation.TraceContext{}), // Send TraceIDs in outgoing requests.
+		),
+	}


I distrust otel a lot...

Isn't this doing a lot of extra work over and over again?

Wouldn't it be better to create a client once (e.g. sync.OnceValue)

Also, this line:

otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),

isn't the tracer provider stored in the context? You don't need to extract the span to extract its provider, you can ask for the provider directly from the context. I guess there are situations where you have multiple tracer provider, but unless I'm missing something this isn't one of them.

Isn't this doing a lot of extra work over and over again?

Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.

isn't the tracer provider stored in the context? You don't need to extract the span to extract its provider, you can ask for the provider directly from the context. I guess there are situations where you have multiple tracer provider, but unless I'm missing something this isn't one of them.

Also possible. I would do it that way if I was creating the http.Client only once.

My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way" (Example). I will do it like that everywhere I can. Storing the tracerprovider in the context directly is a hack I made up for two reasons:

It doesnt make sense, in our codebase, for a root span to be created before a scrape is started (i.e. there is no global span anywhere)

To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.

I will only use the tracing hack where necessary, so if at some point we want to drop the hack (the tracing package), we will need to change the least amount of code possible. That's what I tried to convey with the comment here:

synthetic-monitoring-agent/internal/tracing/tracing.go

Lines 55 to 57 in 3b2a355

// Note: This function is useful to obtain a TracerProvider to generate a root span. If a span already exists in the

// context, the idiomatic way of retrieving it is using:

// trace.SpanFromContext(ctx).TracerProvider()

Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.

Note there's a subtle change here: when using DefaultClient, the client is not getting created over and over again. Clients are reusable. If I remember things correctly, reusing clients allows the underlying code the reuse connections.

My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way"

I know. I'm not debating that. I'm debating that the otel way doesn't buy you anything in this case. In otel it's theoretically possible to have multiple tracer providers (I challenge you to find something in the documentation that explains why you might want to do that) but within the actual use case we have here, I cannot see that happening.

To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.

Hmm...

The current implementation is confusing.

The tracerprovider is in the context, but you don't want to use it.

Looking at the line of code I flagged, I'm not sure whether that context has or hasn't a span.

The tracerprovider should be added to the scraper and passed down from there to the point where the root span should be created, which is probably the runProber function in scraper.go. There a context is created, which is where the span should be added.

If you do that, the only way to get a tracer provider is from the span, and that resolves the issue of why it's done one way or the other.

The current implementation is confusing.

I will grant you that, it's not the clearest. I was hoping the comment would help, but it's true that it creates some ambiguity. Let me try to drop the context thingie and pass the tracerprovider everywhere. It will be a noisy PR.

I've reworked the stuff. At this point it didn't make sense to keep with the fixup!s, so I've rebased the entire thing. Changes roughly are:

Objects now receive the TracerProvider normally, not through the context. I can hardly find words for how painful it was to add this on everywhere. I've spent one hour trying to get the thing to compile and another hour debugging nil pointer dereferences. I'm still not sure all of them are solved.

Root spans are now initiated by the scraper and the adhoc handler. Probers no longer know what a trace is. This means we lose the ability to record probe errors in a span, as Probers hide that behind the success boolean. We can create a child span from the Prober just with that info, but that would be duplicating it on every prober. Your call.

internal/k6runner/k6runner.go

mem · 2024-10-09T13:28:27Z

cmd/synthetic-monitoring-agent/main.go

+	defer cancel()
+
+	var tracerProvider trace.TracerProvider = noop.NewTracerProvider()
+	if os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT") != "" || os.Getenv("OTEL_EXPORTER_OTLP_TRACES_ENDPOINT") != "" {


This is still a problem.

The implicit behavior that otel favors is problematic.

Take the URL from a command line option.

Ignore the environment variables and just use otlptracehttp.WithEndpointURL.

If no option is passed, use the noop tracer provider.

Regardless of any of this, set the global tracer provider to noop, too.

That last point might be contentious: just like the code you have here, there are libraries out there which go thru the same dance of looking at the environment and setting up tracing all by themselves, or even worse, allow otel to do all of that on its own. Libraries should not define any of this. They should simply look for a span in the context and act accordingly.

And I don't want to use a code review to beat a dead horse, but I hope that this allows you to realize how problematic the functional options pattern is. There is an option to override the behavior with the environment variables, it's just very well hidden by that pattern.

This is still a problem.

I don't see the problem. What you suggest is what I would do if this SDK was used in vacuum, but it is not. The entire ecosystem is build around the expectation that you set those environment variables and it works. It's not just the trace exporter, it's the sampling algorithm, parameters, all of it is configured automatically from environment variables. We would be going out of our way to change how one thing is configured, and by doing that we make that thing inconsistent with the rest of the ecosystem, and also with the rest of the things that can be configured in this very project. I really don't think that's a good idea.

you to realize how problematic the functional options pattern is

I do, I don't like it either. Big fan of opts structs myself.

The entire ecosystem is build around the expectation that you set those environment variables and it works.

The ecosystem is built around an ill-defined notion of "ease of use". I have had to debug issues with Otel-using libraries that obtain configuration from the environment "because it's there".

I do not want stuff happening behind our backs, and more importantly I do not want stuff happening behind our user's backs. I don't want users accidentally pushing traces. If they do want to push traces, they can explicitly ask for it.

I have added a commit on top of this PR. If you are not OK with that, remove it.

internal/k6runner/http.go

mem · 2024-10-10T01:38:53Z

internal/k6runner/http.go

+	httpClient := http.Client{
+		Transport: otelhttp.NewTransport(
+			http.DefaultTransport,
+			otelhttp.WithTracerProvider(trace.SpanFromContext(ctx).TracerProvider()),
+			// Span names do not include method and path by default to avoid cardinality explosion with paths containing
+			// IDs. As this is not the case with this endpoint, we use a custom formatter that includes both.
+			otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string {
+				return fmt.Sprintf("%s %s", r.Method, r.URL.Path)
+			}),
+			otelhttp.WithPropagators(propagation.TraceContext{}), // Send TraceIDs in outgoing requests.
+		),
+	}


Theoretically, yes. We should be able to store this client somewhere, but figuring out where was more work than just creating it anew.

Note there's a subtle change here: when using DefaultClient, the client is not getting created over and over again. Clients are reusable. If I remember things correctly, reusing clients allows the underlying code the reuse connections.

My justification is as follows: Storing the span in the context and retrieving the tracerprovider from it is "the otel way"

I know. I'm not debating that. I'm debating that the otel way doesn't buy you anything in this case. In otel it's theoretically possible to have multiple tracer providers (I challenge you to find something in the documentation that explains why you might want to do that) but within the actual use case we have here, I cannot see that happening.

To avoid adding the tracerprovider to the constructor of every scraper and prober in the code.

Hmm...

The current implementation is confusing.

The tracerprovider is in the context, but you don't want to use it.

Looking at the line of code I flagged, I'm not sure whether that context has or hasn't a span.

The tracerprovider should be added to the scraper and passed down from there to the point where the root span should be created, which is probably the runProber function in scraper.go. There a context is created, which is where the span should be added.

If you do that, the only way to get a tracer provider is from the span, and that resolves the issue of why it's done one way or the other.

internal/k6runner/k6runner.go

roobre · 2024-10-10T11:27:38Z

internal/scraper/scraper.go

+	defer checkSpan.End()
+	checkSpan.SetAttributes(
+		attribute.String("type", s.prober.Name()),
+		attribute.String("region", s.probe.Region),


This seems to be the probe region (as expected), not the tenant region. Not sure where to get the latter from.

Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>

roobre force-pushed the http-runner-tracing branch 3 times, most recently from f4faba0 to 7f71def Compare October 2, 2024 16:40

roobre changed the title ~~k6runner/http: add tracing support~~ Add distributed tracing support Oct 2, 2024

roobre force-pushed the http-runner-tracing branch 5 times, most recently from e1e7ac3 to 6e5d732 Compare October 3, 2024 14:49

roobre changed the title ~~Add distributed tracing support~~ Add distributed tracing for k6-backed checks Oct 3, 2024

roobre force-pushed the http-runner-tracing branch 5 times, most recently from c01bb9d to b386066 Compare October 4, 2024 13:57

roobre marked this pull request as ready for review October 4, 2024 14:10

roobre requested a review from a team as a code owner October 4, 2024 14:10

roobre force-pushed the http-runner-tracing branch 4 times, most recently from 4f8df52 to a054992 Compare October 8, 2024 10:06

mem requested changes Oct 9, 2024

View reviewed changes

roobre force-pushed the http-runner-tracing branch 4 times, most recently from d3a6d18 to 3b2a355 Compare October 9, 2024 12:15

roobre requested a review from mem October 9, 2024 12:41

roobre force-pushed the http-runner-tracing branch 2 times, most recently from fde200d to 2973a75 Compare October 9, 2024 13:07

mem requested changes Oct 10, 2024

View reviewed changes

roobre force-pushed the http-runner-tracing branch 3 times, most recently from 2661413 to 6ebe479 Compare October 10, 2024 10:13

roobre requested a review from mem October 10, 2024 11:25

roobre force-pushed the http-runner-tracing branch from 6ebe479 to bf0a1df Compare October 10, 2024 11:26

roobre commented Oct 10, 2024

View reviewed changes

roobre force-pushed the http-runner-tracing branch 2 times, most recently from 6a4187e to 8d91f8f Compare October 10, 2024 13:27

roobre added 4 commits October 14, 2024 15:55

tracing: add TracerProvider and initiate traces for check runs

c14463d

main: initialize tracerprovider and hand it down to handlers

e8f09b8

k6runner: create spans for k6-backed check runs

717bce6

k6runner/http: create spans in k6 http runner

9ff0ca4

roobre force-pushed the http-runner-tracing branch from 8d91f8f to 9ff0ca4 Compare October 14, 2024 13:55

Pass argument on the command line to specify OTLP endpoint

dc8d27d

Signed-off-by: Marcelo E. Magallon <marcelo.magallon@grafana.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add distributed tracing for k6-backed checks #923

Add distributed tracing for k6-backed checks #923

roobre commented Oct 2, 2024 •

edited

Loading

mem left a comment

mem Oct 9, 2024

mem Oct 9, 2024

roobre Oct 9, 2024

roobre Oct 9, 2024 •

edited

Loading

mem Oct 10, 2024

mem Oct 9, 2024

roobre Oct 9, 2024

roobre Oct 9, 2024 •

edited

Loading

mem Oct 10, 2024

roobre Oct 10, 2024

roobre Oct 10, 2024 •

edited

Loading

mem Oct 9, 2024

roobre Oct 10, 2024

mem Oct 15, 2024

mem Oct 10, 2024

roobre Oct 10, 2024

	// Note: This function is useful to obtain a TracerProvider to generate a root span. If a span already exists in the
	// context, the idiomatic way of retrieving it is using:
	// trace.SpanFromContext(ctx).TracerProvider()

Add distributed tracing for k6-backed checks #923

Are you sure you want to change the base?

Add distributed tracing for k6-backed checks #923

Conversation

roobre commented Oct 2, 2024 • edited Loading

mem left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roobre Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roobre Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roobre Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roobre commented Oct 2, 2024 •

edited

Loading

roobre Oct 9, 2024 •

edited

Loading

roobre Oct 9, 2024 •

edited

Loading

roobre Oct 10, 2024 •

edited

Loading