[common/thrift_client_pool] close connections that are in bad state t… #563

newpoo · 2021-12-10T02:10:10Z

…o avoid accumulating CLOSE_WAIT

rajathprasad · 2021-12-10T17:57:18Z

common/thrift_client_pool.h

+      int n = 0;
+      while (itor != channels_.end() &&
+	     n++ < FLAGS_channel_max_checking_size) {
+	auto c = itor->second.first.lock();


Is the lock here to ensure that the weak ptr is not released while it's status is checked in the next few lines?

Oh lock() converts the weak pointer to a shared pointer: https://en.cppreference.com/w/cpp/memory/weak_ptr/lock
And we need to convert to a shared pointer to be able to access the underlying object.

this creates a shared_ptr every time, for FLAGS_channel_max_checking_size number of times, is that indented?

Yes, it's intended. itor will be advanced either via erase() or ++ below. So it creates a shared_ptr for different channel each time.

oh I misread -- I thought this function only cleans the channel for the specified addr. It seems it will clean-up up to FLAGS_channel_max_checking_size channels? in the common case I think that may be wasted work assuming connections are healthy?

Not introduce by this PR though, so just to understand the original motivation.

If a connection is health, the client (thrift_router, etc) of thrift_client_pool will cache the returned shared_ptr so that it can be reused. In this case, the weak_ptr will be able to lock() into a shared_ptr, so this function won't do anything to it.

yeah, but still a busy-loop for maximally FLAGS_channel_max_checking_size times. Not a big deal though, because this only happens when getClient is called, which only happens when thrift_router tries to create a client for the first time or try to fix a bad client.

also, created #565 as an alternative to this approach, i.e. since we already know when a channel is bad, we can close it right before creating a new one for a destination address, wdyt?

Feel free to try it out. I can abandon this one if #565 works.
One subtle difference is that #565 only close connections that are being requested. This PR will clean up bad connections to any destinations which may not be requested.

I think I understand more now, so I tried #565, though it closes the connection promptly when it's bad, it's not helping much, they may be destroyed anyway when all share_ptr to the channel are released. What's more, as you mentioned in another conversation below, it's mostly due to the idle thread holding on the "bad" channels.

And the channels in the idle threads became "bad" because the server side closing the idle connection after idle_timeout.

I also tested this PR in a private build, though it doesn't fully eliminate all the CLOSE_WAIT, it keeps the count to be much smaller (typically <50 per process on the realpin service I tried, compared to 100s or even a couple 1000s before), so I think this PR seems worth it. Feel free to land.

rajathprasad · 2021-12-10T18:05:59Z

common/thrift_client_pool.h

+
+	// the channel is bad
+	if (!itor->second.second->is_good.load()) {
+	  c->closeNow(); // close the connection to avoid accumulating CLOSE_WAIT


Is it not necessary to remove the channel from the map in this case?

I wanted to minimize behavior change to reduce the risk.

If we remove it here, then we will more aggressively establish new connections to the the destination.

When the connection was released (weak_ptr.lock() fails), we know that it's been a while since the connection was established.

When the connection is not released but !good() (here), we are not sure about that.

Sounds good, thanks for the explanation.

jaricftw · 2021-12-10T18:33:06Z

common/thrift_client_pool.h

+      int n = 0;
+      while (itor != channels_.end() &&
+	     n++ < FLAGS_channel_max_checking_size) {
+	auto c = itor->second.first.lock();


this creates a shared_ptr every time, for FLAGS_channel_max_checking_size number of times, is that indented?

jaricftw · 2021-12-10T18:36:26Z

common/thrift_client_pool.h

+      while (itor != channels_.end() &&
+	     n++ < FLAGS_channel_max_checking_size) {
+	auto c = itor->second.first.lock();
+	// the channel has been released


Q: under what circumstances will the channel be released? (or put it another way, clarify what "release" means here, which is less clear than "closed")

Released means all shared_ptrs that are associated with the weak_ptr has been destructed. So the underlying channel object has been destructed and its connection has been closed.

There are two layers: thrift_client_pool and thrift_router.

In the initial design of thrift_client_pool, the assumption was that its client (thrift_router, etc) will release (destruct) the returned client shared_ptr object shortly after the connection becomes bad (such as closed by the remote peer). This assumption turned out to be not always true. thrift_router stores the shared_ptr client objects from thrift_client_pool in a thread local data structure. Though active threads will release (destruct) the shard_ptr objects shortly after a connection becomes bad, idle thread won't. (Some services have hundreds of worker threads, and they are scheduled in a LIFO way, i.e., the worker pool always tries to use the idle threads that were most recently used).

thanks for the detailed explanation!

jaricftw · 2021-12-10T18:37:13Z

…o avoid accumulating CLOSE_WAIT

initial summary seems truncated

jaricftw · 2021-12-10T18:37:55Z

oh PR title is truncated as well

jaricftw · 2021-12-10T22:20:15Z

common/thrift_client_pool.h

+      while (itor != channels_.end() &&
+	     n++ < FLAGS_channel_max_checking_size) {
+	auto c = itor->second.first.lock();
+	// the channel has been released


thanks for the detailed explanation!

jaricftw · 2021-12-10T22:24:51Z

common/thrift_client_pool.h

+      int n = 0;
+      while (itor != channels_.end() &&
+	     n++ < FLAGS_channel_max_checking_size) {
+	auto c = itor->second.first.lock();


oh I misread -- I thought this function only cleans the channel for the specified addr. It seems it will clean-up up to FLAGS_channel_max_checking_size channels? in the common case I think that may be wasted work assuming connections are healthy?

Not introduce by this PR though, so just to understand the original motivation.

[common/thrift_client_pool] close connections that are in bad state t…

1156c66

…o avoid accumulating CLOSE_WAIT

newpoo requested a review from rajathprasad December 10, 2021 02:10

rajathprasad reviewed Dec 10, 2021

View reviewed changes

rajathprasad approved these changes Dec 10, 2021

View reviewed changes

jaricftw reviewed Dec 10, 2021

View reviewed changes

jaricftw approved these changes Dec 10, 2021

View reviewed changes

newpoo merged commit a4ea394 into master Dec 14, 2021

newpoo mentioned this pull request Dec 14, 2021

[thrift_client_pool] proactively close connections #395

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[common/thrift_client_pool] close connections that are in bad state t… #563

[common/thrift_client_pool] close connections that are in bad state t… #563

newpoo commented Dec 10, 2021

rajathprasad Dec 10, 2021

rajathprasad Dec 10, 2021

jaricftw Dec 10, 2021

newpoo Dec 10, 2021

jaricftw Dec 10, 2021

newpoo Dec 11, 2021

jaricftw Dec 11, 2021

jaricftw Dec 11, 2021

newpoo Dec 13, 2021

jaricftw Dec 14, 2021

rajathprasad Dec 10, 2021

newpoo Dec 10, 2021

rajathprasad Dec 10, 2021

jaricftw Dec 10, 2021

jaricftw Dec 10, 2021

newpoo Dec 10, 2021

jaricftw Dec 10, 2021

jaricftw commented Dec 10, 2021

jaricftw commented Dec 10, 2021

jaricftw Dec 10, 2021

jaricftw Dec 10, 2021

[common/thrift_client_pool] close connections that are in bad state t… #563

[common/thrift_client_pool] close connections that are in bad state t… #563

Conversation

newpoo commented Dec 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaricftw commented Dec 10, 2021

jaricftw commented Dec 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment