Performance doesn't scale with more cores #1977

AqlaSolutions · 2023-04-27T13:00:45Z

We've performed several benchmarks on 8, 16, 32, 48 AWS cores and discovered these results:

8->16 cores: almost +100% RPS
16->32 cores: +20% RPS
32->48 cores: +1% RPS

Take a look at these results.

16 cores:

32 cores:

48 cores:

The benchmark runs up to 256 parallel requests.

Profiler shows that most work is done by thread pool WorkerThreadStart method inside its loop where it waits for tasks and calls Semaphore.Wait.

We tried running different configurations with 1-2 clients and 1-2 servers, changing parallel requests count, changing dispatcher throughput but nothing showed any significant improvement.

What can cause this?

We'll try to prepare a reproducable example if you are willing to investigate this.

The text was updated successfully, but these errors were encountered:

rogeralsing · 2023-04-28T04:31:43Z

Yes, we need some code examples for what you are doing here.
I see in the benchmark that you are using Proto.Remote. it could very well be that you are maxing out your network ?
if the network can only push x messages per second, you are not going to benefit from more cores.

But please do post some example on what you are actually doing here, as it is only guesswork otherwise

AqlaSolutions · 2023-04-28T08:00:38Z

We have the similar issue even without Remote

rogeralsing · 2023-04-28T08:44:04Z

ok, then we need some code example to reproduce this

AqlaSolutions · 2023-05-07T18:18:11Z

Sorry, we are still going to provide an example. This stuff requires some time to agree with the company.

Pushcin · 2023-05-16T06:48:21Z

the source code to reproduce the problem in the attachment
performance-repro.zip

AqlaSolutions · 2023-05-16T06:59:15Z

To run from IDE:

Open solution BEP.sln
Run docker-start-dev.cmd
Run project benchmarks\PrototypeBenchmark

To run remotely:

Install docker on remote machine and Docker Desktop on local
ssh -L 2378:127.0.0.1:2375 ubuntu@example.com
in another session, git bash:
export DOCKER_HOST=tcp://127.0.0.1:2378
./docker-start-staging.cmd

@rogeralsing please reopen

rogeralsing · 2023-05-18T08:15:49Z

I'm running the example right now and the first thing that comes to mind is that you are probably queueing up a lot of fire and forget tasks on the threadpool

.5987 RPS, 99% latency 17,61 ms, 95% latency 9,39 ms, max latency 167,61 ms
...60692 RPS, 99% latency 15,4 ms, 95% latency 6,51 ms, max latency 610,77 ms
...44698 RPS, 99% latency 20,9 ms, 95% latency 9,33 ms, max latency 745,69 ms
..35911 RPS, 99% latency 28,62 ms, 95% latency 11,54 ms, max latency 725,73 ms
.27488 RPS, 99% latency 33,26 ms, 95% latency 15,39 ms, max latency 999,47 ms
..31520 RPS, 99% latency 22,41 ms, 95% latency 11,55 ms, max latency 975,2 ms
.19651 RPS, 99% latency 39,24 ms, 95% latency 20,35 ms, max latency 1050,25 ms
.19856 RPS, 99% latency 39,76 ms, 95% latency 17,88 ms, max latency 1366,85 ms

The increasing latency might be that the threadpool is busy with other tasks.
e.g.

omsGrain.ProccedExecutionReport(omsRequest, CancellationToken.None).AndForget(TaskOption.Safe);

Eventually, the entire threadpool queue might be filled with this kind of tasks.

I'll dig deeper later today, but the increasing latency is very suspicious.

AqlaSolutions · 2023-05-18T08:22:40Z

That's a pretty big latency. Do you run in Debug configuration or with debugger attached? I wouldn't rely on these latency numbers. As you see in the screenshots above, with all optimizations and 16+ cores we don't have latency changing that much.

AqlaSolutions · 2023-05-18T08:30:16Z

In this repro no additional executions are added to the list in ObActor.ExecuteOrder so OmsActor shouldn't do any fire-and-forget calls because the single returned ExecutionReport belongs to this OmsActor instance. So I'm surprised that you see such calls. Though, in our real app those calls are present.

rogeralsing · 2023-05-20T08:44:38Z

There seems to be a lot of locking going on in this example.
I saw that there is some use of SemaphoreSlim and .Wait(), but I haven't analyzed the impact of that specifically.
But looking at the profiler results. something in this example is explicitly blocking threads in the threadpool

AqlaSolutions · 2023-05-21T07:46:45Z

@rogeralsing, we use Semaphore to limit the number of concurrent requests. The waiting time should be as big as needed for the system to process a request and free its "slot". It's not a problem at all. It's only one thread and it even doesn't belong to threadpool.

We already profiled the thing, I already saw what's on the screenshots. For example, WorkerThreadStart is not a new thread startup but a loop that pick ups tasks from thread pool queue. Btw, this method also uses Semaphore.

AqlaSolutions · 2023-05-21T07:52:21Z

@rogeralsing my guess is that too much GC is going on in 0 and 1 generations. The garbage is produced by tasks and async state machines. Unlike 2nd generation, such collections are always stop-the-world. It looks like at some point of vertical scaling GC time grows more than the load that can be processed so we see no improvement from adding cores.

rogeralsing closed this as completed May 7, 2023

rogeralsing reopened this May 16, 2023

AqlaSolutions mentioned this issue May 18, 2023

Bug? Same message is received over and over #1967

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance doesn't scale with more cores #1977

Performance doesn't scale with more cores #1977

AqlaSolutions commented Apr 27, 2023

rogeralsing commented Apr 28, 2023

AqlaSolutions commented Apr 28, 2023

rogeralsing commented Apr 28, 2023

AqlaSolutions commented May 7, 2023

Pushcin commented May 16, 2023

AqlaSolutions commented May 16, 2023 •

edited

Loading

rogeralsing commented May 18, 2023

AqlaSolutions commented May 18, 2023 •

edited

Loading

AqlaSolutions commented May 18, 2023

rogeralsing commented May 20, 2023

AqlaSolutions commented May 21, 2023 •

edited

Loading

AqlaSolutions commented May 21, 2023 •

edited

Loading

Performance doesn't scale with more cores #1977

Performance doesn't scale with more cores #1977

Comments

AqlaSolutions commented Apr 27, 2023

rogeralsing commented Apr 28, 2023

AqlaSolutions commented Apr 28, 2023

rogeralsing commented Apr 28, 2023

AqlaSolutions commented May 7, 2023

Pushcin commented May 16, 2023

AqlaSolutions commented May 16, 2023 • edited Loading

rogeralsing commented May 18, 2023

AqlaSolutions commented May 18, 2023 • edited Loading

AqlaSolutions commented May 18, 2023

rogeralsing commented May 20, 2023

AqlaSolutions commented May 21, 2023 • edited Loading

AqlaSolutions commented May 21, 2023 • edited Loading

AqlaSolutions commented May 16, 2023 •

edited

Loading

AqlaSolutions commented May 18, 2023 •

edited

Loading

AqlaSolutions commented May 21, 2023 •

edited

Loading

AqlaSolutions commented May 21, 2023 •

edited

Loading