Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WaitHandleWait profiler #5814

Open
verdie-g opened this issue Jul 26, 2024 · 3 comments
Open

WaitHandleWait profiler #5814

verdie-g opened this issue Jul 26, 2024 · 3 comments
Assignees
Labels
area:profiler Issues related to the continous-profiler

Comments

@verdie-g
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Thread pool starvation is a very common performance issue in the services of my company. They are usually caused by sync-over-async. It can be relatively easy to find them when it's a continuous issue, but it gets tricky when it occurs the callback of a 15 minutes timer.

Describe the solution you'd like

In .NET 9+, I added an event (dotnet/runtime#94737) that is emitted whenever a managed thread is blocked (Monitor.Wait, Task.Wait, MRES.Wait, SemaphoreSlim.Wait, ...). It would be awesome if this event was collected just like the exceptions, allocations, contention profilers.

One annoying this about this event is that most occurrences will usually happen in non-thread-pool threads that spend most of their time blocking. For that, I opened dotnet/runtime#102326.

Describe alternatives you've considered

Currently, I took super long traces of the WaitHandleWait event and try to find the issue in PerfView but it's not reliable.

@gleocadie
Copy link
Collaborator

Hi @verdie-g

thanks for the FR, that's really interesting. Thread starvation is a kind of things we are trying to detect and report in our UI.
Currently, we push a metric on the number of managed threads which can help to detect a threadpool starvation (continuously increasing number of threads) but nothing more (no callstacks, no additional info).
We want to add more info and thanks to your PR we could add another level of information (thanks for that :) ).

We can collect this event and it will be done for all managed threads. As of today, the filter is done in the UI (using the name/id...).

Just to be sure, and make this useful for everyone:
As a user, how would see that information usefully displayed to you (and gives you actionable ways to investigate/fix it) ? (callstacks, metrics, monitors...)

We'll let you know about the status of this FR.
Thanks for the proposition.

@gleocadie gleocadie added the area:profiler Issues related to the continous-profiler label Jul 31, 2024
@verdie-g
Copy link
Contributor Author

Just to be sure, and make this useful for everyone:
As a user, how would see that information usefully displayed to you

I would like to see a flamegraph of the waits just like the contention.

Ideally, the unit would but the wait time. The number of events can be extremely misleading. Unlike the contention event, it was decided not to include it in the wait event. In my custom perfview, I use the thread id of the event to match the start and the stop.

As a metric, the total wait time could also be interesting. For example, when a spike of latency is observed between two services but the calling service has a spike of wait time, it could help understand that the issue is not in the called service.

@gleocadie
Copy link
Collaborator

Just to be sure, and make this useful for everyone:
As a user, how would see that information usefully displayed to you

I would like to see a flamegraph of the waits just like the contention.

Ideally, the unit would but the wait time. The number of events can be extremely misleading. Unlike the contention event, it was decided not to include it in the wait event. In my custom perfview, I use the thread id of the event to match the start and the stop.

As a metric, the total wait time could also be interesting. For example, when a spike of latency is observed between two services but the calling service has a spike of wait time, it could help understand that the issue is not in the called service.

Thanks, I'll create a FR on our backlog and we'll have to plan it.

@gleocadie gleocadie self-assigned this Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:profiler Issues related to the continous-profiler
Projects
None yet
Development

No branches or pull requests

2 participants