Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTel Context is getting lost in GraphQL manual instrumentation #6583

Open
govi20 opened this issue Jul 16, 2024 · 14 comments
Open

OTel Context is getting lost in GraphQL manual instrumentation #6583

govi20 opened this issue Jul 16, 2024 · 14 comments
Labels
Bug Something isn't working

Comments

@govi20
Copy link

govi20 commented Jul 16, 2024

I have an async GraphQL resolver which I am using along with a GraphQLdata loader. The GraphQL service performs manual instrumentation.
The problem I am facing here is that OpenTelemetry context is getting lost in the load method. This is happening even before offloading the task on the context-wrapped executor.

Steps to reproduce

  1. Clone: https://github.com/govi20/dgs-otel
  2. Build and run the project.
  3. Access localhost:9090/graphiql
  4. execute the following GraphQL query and see the logs where I've printed Context in DepartmentDataLoader
query emploees {
  employees {
    id
    name
    department {
      id
      name
    }
  }
}

What did you expect to see?
Otel Span and context should be available in the DepartmentDataLoader's load() method as there is no thread switch in between. the span's end method is not called so I assume the span is not closed either

What did you see instead?
OTel Context is not getting propagated.

What version and what artifacts are you using?
Version: 1.38.0, I use custom implementation of SpanExporter.

Environment
This is not environment specific issue, it's reproducible on MacOS as well as CentOS.

Additional context
I've reported this issue on GraphQL DGS Framework that I use: Netflix/dgs-framework#1928
The Netflix DGS Folks have recommended me to check with tracing framework team.

@govi20 govi20 added the Bug Something isn't working label Jul 16, 2024
@govi20
Copy link
Author

govi20 commented Jul 16, 2024

Let me know if a sample project with minimal reproducible code is required.

@jack-berg
Copy link
Member

Let me know if a sample project with minimal reproducible code is required.

That would be very useful 🙂

@govi20
Copy link
Author

govi20 commented Jul 21, 2024

@jack-berg here is a sample project with a reproducible example https://github.com/govi20/dgs-otel

This is where the OTel Context gets lost => DepartmentDataloader

and the data loaders get called from here => EmployeeDataFetcher and there is no thread switch in between. The thread pool that I have configured executor that wraps the task using Context.taskWrapping API.

I've added steps to reproduce in the bug report.

@jkwatson
Copy link
Contributor

It looks like dgs is based on graphql-java, which has library instrumentation (https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/graphql-java/graphql-java-20.0/library). I wonder if you just plug in that instrumentation library if a bunch of these issues would go away. Worth a shot, at least, to see how it does!

@govi20
Copy link
Author

govi20 commented Jul 21, 2024

@jkwatson does this library perform auto instrumentation? I need to rely on manual instrumentation logic

@jkwatson
Copy link
Contributor

The "library" instrumentation doesn't require the javaagent. You can just plug it in programmatically. You might need to figure out how to hook it into dgs, but it doesn't need the agent.

@jkwatson
Copy link
Contributor

Given this documentation, I would guess it'll work just fine: https://netflix.github.io/dgs/advanced/instrumentation/

@govi20
Copy link
Author

govi20 commented Jul 22, 2024

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

@jkwatson
Copy link
Contributor

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

I recommend trying the library instrumentation then, and raising issues in the instrumentation repo if it has holes.

@govi20
Copy link
Author

govi20 commented Jul 22, 2024

@jkwatson Unfortunately, I can't use the library because it lacks the ability to instrument GraphQL data resolvers.

But still, I've tried it out, and the issue can indeed be reproducible with this library. I plan to report this issue to the instrumentation repo. However, I'm unsure if the problem lies within the core libraries or in the instrumentation.

@jkwatson
Copy link
Contributor

The issue won't be with the core libraries. It's definitely an issue with making the instrumentation correctly propagate the context where it needs to go.

@harshitrjpt
Copy link

Continuing @kilink 's comment Netflix/dgs-framework#1928 (comment) ,
I don't think it's a bug in otel java instrumentation either.
We know that to pass context in CompletableFuture we need to wrap Executor in Context.taskWrapping(), which is already done for the dgsAsyncTaskExecutor.
However the catch is:
since batch loading is in action here, there is another CompletableFuture, which uses default executor, for the Batch Loader wrapping the DepartmentLoader. This is the reason why context is reset. If you debug line 39 EmployeeDataFetcher.department() and keep going in, you can see DataLoaderHelper.queueOrInvokeLoader()

Screenshot 2024-09-10 at 9 11 55 PM

To me it seems parameters.get(0).getContext().makeCurrent() is the only way to propagate context, without changing the current code.

@govi20
Copy link
Author

govi20 commented Sep 10, 2024

since batch loading is in action here, there is another CompletableFuture, which uses default executor

If I remember correctly, it doesn’t work even if the batch contains only 1 parameter.

To me it seems parameters.get(0).getContext().makeCurrent() is the only way to propagate context

Yes that fixes the issue but it is a workaround, doesn’t look elegant because ‘parameter’ is a domain object.

@harshitrjpt
Copy link

harshitrjpt commented Sep 10, 2024

If I remember correctly, it doesn’t work even if the batch contains only 1 parameter.

Doesn't matter if it's 1 or more parameter. It's the DataLoader framework which is spitting out the CompletableFuture with fixed default executor.

Yes that fixes the issue but it is a workaround, doesn’t look elegant because ‘parameter’ is a domain object.

Doesn't seem that this is in opentelemetry control. Opentelemetry already provides context propagation by customizing CompletableFuture with otel context wrapped executor, but the DataLoader framework is using fixed default executor in between.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants