-
-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance doesn't scale with more cores #1977
Comments
Yes, we need some code examples for what you are doing here. But please do post some example on what you are actually doing here, as it is only guesswork otherwise |
We have the similar issue even without Remote |
ok, then we need some code example to reproduce this |
Sorry, we are still going to provide an example. This stuff requires some time to agree with the company. |
the source code to reproduce the problem in the attachment |
To run from IDE:
To run remotely:
@rogeralsing please reopen |
I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool
The increasing latency might be that the threadpool is busy with other tasks.
Eventually, the entire threadpool queue might be filled with this kind of tasks. I'll dig deeper later today, but the increasing latency is very suspicious. |
That's a pretty big latency. Do you run in Debug configuration or with debugger attached? I wouldn't rely on these latency numbers. As you see in the screenshots above, with all optimizations and 16+ cores we don't have latency changing that much. |
In this repro no additional executions are added to the list in |
@rogeralsing, we use Semaphore to limit the number of concurrent requests. The waiting time should be as big as needed for the system to process a request and free its "slot". It's not a problem at all. It's only one thread and it even doesn't belong to threadpool. We already profiled the thing, I already saw what's on the screenshots. For example, WorkerThreadStart is not a new thread startup but a loop that pick ups tasks from thread pool queue. Btw, this method also uses Semaphore. |
@rogeralsing my guess is that too much GC is going on in 0 and 1 generations. The garbage is produced by tasks and async state machines. Unlike 2nd generation, such collections are always stop-the-world. It looks like at some point of vertical scaling GC time grows more than the load that can be processed so we see no improvement from adding cores. |
We've performed several benchmarks on 8, 16, 32, 48 AWS cores and discovered these results:
8->16 cores: almost +100% RPS
16->32 cores: +20% RPS
32->48 cores: +1% RPS
Take a look at these results.
16 cores:
32 cores:
48 cores:
The benchmark runs up to 256 parallel requests.
Profiler shows that most work is done by thread pool
WorkerThreadStart
method inside its loop where it waits for tasks and callsSemaphore.Wait
.We tried running different configurations with 1-2 clients and 1-2 servers, changing parallel requests count, changing dispatcher throughput but nothing showed any significant improvement.
What can cause this?
We'll try to prepare a reproducable example if you are willing to investigate this.
The text was updated successfully, but these errors were encountered: