Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade dependency opensearch-java 2.1.0 to newer #4821

Open
vttranlina opened this issue Jul 18, 2023 · 19 comments
Open

Upgrade dependency opensearch-java 2.1.0 to newer #4821

vttranlina opened this issue Jul 18, 2023 · 19 comments
Assignees

Comments

@vttranlina
Copy link
Member

vttranlina commented Jul 18, 2023

The update opensearch-java 2.1.0 -> 2.6.0 make a test case fail.

Investigate why and update it successfully

apache#1647 (comment)

apache#1647 (comment)

EDIT @Arsnael :

We need as well to:

  • try a refactoring of the code, as the new opensearch java client has its own http transport class. Maybe if not too problematic, we could drop then the opensearch-rest-client dependency.
  • need to do a perf test of that refactoring and compare if we get better results or if it is similar (or worse)
@Arsnael
Copy link
Member

Arsnael commented Jul 18, 2023

JIRA: https://issues.apache.org/jira/browse/JAMES-3929

@Arsnael Arsnael self-assigned this Jul 18, 2023
@Arsnael
Copy link
Member

Arsnael commented Jul 18, 2023

PR: apache#1648

Locally it seemed to pass, but let's see with the CI. Then I will run some perf tests

@Arsnael
Copy link
Member

Arsnael commented Jul 24, 2023

opensearch.zip

Flame graphs and thread dump... Everything seems blocked but really hard to see what's wrong. I think I need to do those maybe more like a bit after I started a run? I waited too long and James seems completely bloated. Will try again

@quantranhong1999
Copy link
Member

Can u try to get a heap dump (may be upon OOM - JVM has an option for this) and share to the team?

@chibenwa
Copy link
Member

A CPU spending 80% of it's time allocating arrays: there's something wrong!

@Arsnael
Copy link
Member

Arsnael commented Jul 25, 2023

A CPU spending 80% of it's time allocating arrays: there's something wrong!

Noticed that, not sure where to look at though yet^^'

Can u try to get a heap dump (may be upon OOM - JVM has an option for this) and share to the team?

I did actually, just forgot to upload it yesterday, as it was too big for the zip for github.

@Arsnael
Copy link
Member

Arsnael commented Jul 26, 2023

https://tdrive.linagora.com/shared/111111114AxHLNjGvEGJpA/drive/wAFGZg85kbTkqyFQa4XcL7/t/43aa80f3b2a79faa34d4635e6627be9d0f17d63d

This tie I tried to capture earlier, even if it gets bloated pretty quickly again.

I think in opensearch driver threads, what's interesting is:

	at org.apache.hc.core5.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:54)
	at org.opensearch.client.transport.httpclient5.internal.HeapBufferedAsyncEntityConsumer.data(HeapBufferedAsyncEntityConsumer.java:91)
	at org.apache.hc.core5.http.nio.entity.AbstractBinDataConsumer.consume(AbstractBinDataConsumer.java:75)
	at org.apache.hc.core5.http.nio.support.AbstractAsyncResponseConsumer.consume(AbstractAsyncResponseConsumer.java:141)
	at org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.consume(HttpAsyncMainClientExec.java:227)
	at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamHandler.consumeData(ClientHttp1StreamHandler.java:265)
	at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamDuplexer.consumeData(ClientHttp1StreamDuplexer.java:354)
	at org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onInput(AbstractHttp1StreamDuplexer.java:325)
	at org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.inputReady(AbstractHttp1IOEventHandler.java:64)
	at org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.inputReady(ClientHttp1IOEventHandler.java:41)
	at org.apache.hc.core5.reactor.ssl.SSLIOSession.decryptData(SSLIOSession.java:551)
	at org.apache.hc.core5.reactor.ssl.SSLIOSession.access$400(SSLIOSession.java:72)
	at org.apache.hc.core5.reactor.ssl.SSLIOSession$1.inputReady(SSLIOSession.java:172)
	at org.apache.hc.core5.reactor.InternalDataChannel.onIOEvent(InternalDataChannel.java:133)
	at org.apache.hc.core5.reactor.InternalChannel.handleIOEvent(InternalChannel.java:51)
	at org.apache.hc.core5.reactor.SingleCoreIOReactor.processEvents(SingleCoreIOReactor.java:178)
	at org.apache.hc.core5.reactor.SingleCoreIOReactor.doExecute(SingleCoreIOReactor.java:127)
	at org.apache.hc.core5.reactor.AbstractSingleCoreIOReactor.execute(AbstractSingleCoreIOReactor.java:85)
	at org.apache.hc.core5.reactor.IOReactorWorker.run(IOReactorWorker.java:44)
	at java.lang.Thread.runWith(java.base@20.0.1/Unknown Source)
	at java.lang.Thread.run(java.base@20.0.1/Unknown Source)

We can see it getting errors on ByteArrayBuffer, but from the core5 lib of apache. I;m not sure how far I can dig in it, seeing how it's hard to be able to fully focus on a task for me these days. But will try

@quantranhong1999
Copy link
Member

quantranhong1999 commented Jul 27, 2023

https://tdrive.linagora.com/shared/111111114AxHLNjGvEGJpA/drive/wAFGZg85kbTkqyFQa4XcL7/t/43aa80f3b2a79faa34d4635e6627be9d0f17d63d

Q: This heap dump is achieved from the JVM running tests, right?

A bit analysis: https://heaphero.io/heap-report-wc.jsp?p=RThCZ0lOakJ3K2dCd3FiSEc4VEhsdEkxUGFsRW8rR2tWWjNxdm5rZjVTeFBmUEVYc0xHVEpCK0NkSEF4MVpoYlFnaHBWNkU1dXJKVHpHdnFxOW1Xamc9PQ==

5 times "One instance of "org.apache.hc.core5.reactor.InternalDataChannel" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x740050000" occupies 104,961,680 (15.68%) bytes".

It is exactly 5 tests in WithCassandraBlobStoreTest.

So my guess is that test memory blows up because we do not release the hc5 client gracefully after each test / upon James shutdown.

I checked for production code, the ClientProvider bean is SINGLETON, so we should be good from OOM issue on prod.

@Arsnael
Copy link
Member

Arsnael commented Jul 28, 2023

I tried honestly to reproduce it first running WithCassandraBlobStoreTest, by even putting higher number of messages and I could not reproduce it locally^^' Thus why I switched back with sandbox

@Arsnael
Copy link
Member

Arsnael commented Jul 28, 2023

Really can't get around this. Been trying to check the code on opensearch-java but hardly understands all those schenanigans.

What I think though, is that if you look here: https://heaphero.io/heap-report-wc.jsp?p=RThCZ0lOakJ3K2dCd3FiSEc4VEhsdEkxUGFsRW8rR2tWWjNxdm5rZjVTeFBmUEVYc0xHVEpCK0NkSEF4MVpoYlFnaHBWNkU1dXJKVHpHdnFxOW1Xamc9PQ== (cool online tool btw @quantranhong1999 ) the value of one of the array byte reported as leaking suspect (actually same for all):

{"_scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFnAzVm5fQmdqUWltOFN1RGlDbTQ2cGcAAAAAAAWREBZYcjBBdjlwYlNST3ktd3kzSlBkcWlR","took":40,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":7179,"relation":"eq"},"max_score":null,"hits":[{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:403","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[403]},"sort":[403]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:404","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[404]},"sort":[404]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:405","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[405]},"sort":[405]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:406","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[406]},"sort":[406]},{"_index":"mailbox_v2","_id":"a29f0

So it seems to happen for scrolls. Maybe it doesn't clean up those until the scroll is finished, and because in preprod we have lots of records, it goes boom before the end? Something in those lines I guess.

@Arsnael Arsnael removed their assignment Aug 9, 2023
@Arsnael Arsnael self-assigned this Feb 5, 2024
@Arsnael
Copy link
Member

Arsnael commented Feb 7, 2024

IMAP :

Image

JMAP:

Image

Perf are similar to release 0.8.2, no impact on normal simulation tests

Will try imap full simulation with more occurences on the searchs. Also actually should as well disable searchOverrides that are enabled on ou staging environments to fully test the impact (just thought about it)

@Arsnael
Copy link
Member

Arsnael commented Feb 15, 2024

Well this is a run with searchOverrides off, only IMAP simulation.

0.8.2

Image

http5 update

Image

We can see quite a difference here on the searches. With the work on this PR, dropping the rest client, we loose 35% on searchDeleted and 48% on searchUnseen for P99. Also, there is a 10% perf loss as well on the mean for searchDeleted and searchUnseen, compared to 0.8.2.

I think the work done on the java client is still not good enough to get rid yet of the rest client for transport.

In favor to wait and not merge this work yet.

Other feedbacks welcomed :)

@chibenwa
Copy link
Member

Agree

@quantranhong1999
Copy link
Member

In favor to wait and not merge this work yet.

+1

@Arsnael
Copy link
Member

Arsnael commented Feb 16, 2024

Alright putting it back to the backlog then, but was worth a try, seems better than last time already, just not good enough yet :)

@Arsnael
Copy link
Member

Arsnael commented Jul 4, 2024

Latest master IMAP

Image

Latest master with opensearch http5 transport client

Image

With searchoverrides disabled of course, to force the hits on opensearch.

Huge loss of perfs... Much worse than a few months ago actually, which is rather odd...

Also the docker image size for both is the same, so there is no gain in that either...

Not worth at all the switch

@chibenwa
Copy link
Member

chibenwa commented Jul 4, 2024

May I ask for flame graphs of the terrible run?

@Arsnael
Copy link
Member

Arsnael commented Jul 4, 2024

When I finish with release runs

@Arsnael
Copy link
Member

Arsnael commented Jul 5, 2024

flame_graphs.zip

The core5 lib used for transport with OS is clearly taking around 30% cpu and 50% memory. Badly optimized?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants