-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[common/thrift_client_pool] close connections that are in bad state t… #563
Conversation
…o avoid accumulating CLOSE_WAIT
int n = 0; | ||
while (itor != channels_.end() && | ||
n++ < FLAGS_channel_max_checking_size) { | ||
auto c = itor->second.first.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the lock here to ensure that the weak ptr is not released while it's status is checked in the next few lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh lock() converts the weak pointer to a shared pointer: https://en.cppreference.com/w/cpp/memory/weak_ptr/lock
And we need to convert to a shared pointer to be able to access the underlying object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates a shared_ptr every time, for FLAGS_channel_max_checking_size
number of times, is that indented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's intended. itor will be advanced either via erase() or ++ below. So it creates a shared_ptr for different channel each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I misread -- I thought this function only cleans the channel for the specified addr
. It seems it will clean-up up to FLAGS_channel_max_checking_size
channels? in the common case I think that may be wasted work assuming connections are healthy?
Not introduce by this PR though, so just to understand the original motivation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a connection is health, the client (thrift_router, etc) of thrift_client_pool will cache the returned shared_ptr so that it can be reused. In this case, the weak_ptr will be able to lock() into a shared_ptr, so this function won't do anything to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, but still a busy-loop for maximally FLAGS_channel_max_checking_size
times. Not a big deal though, because this only happens when getClient
is called, which only happens when thrift_router tries to create a client for the first time or try to fix a bad client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, created #565 as an alternative to this approach, i.e. since we already know when a channel is bad, we can close it right before creating a new one for a destination address, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand more now, so I tried #565, though it closes the connection promptly when it's bad, it's not helping much, they may be destroyed anyway when all share_ptr
to the channel are released. What's more, as you mentioned in another conversation below, it's mostly due to the idle thread holding on the "bad" channels.
And the channels in the idle threads became "bad" because the server side closing the idle connection after idle_timeout
.
I also tested this PR in a private build, though it doesn't fully eliminate all the CLOSE_WAIT, it keeps the count to be much smaller (typically <50 per process on the realpin service I tried, compared to 100s or even a couple 1000s before), so I think this PR seems worth it. Feel free to land.
|
||
// the channel is bad | ||
if (!itor->second.second->is_good.load()) { | ||
c->closeNow(); // close the connection to avoid accumulating CLOSE_WAIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not necessary to remove the channel from the map in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to minimize behavior change to reduce the risk.
If we remove it here, then we will more aggressively establish new connections to the the destination.
When the connection was released (weak_ptr.lock() fails), we know that it's been a while since the connection was established.
When the connection is not released but !good() (here), we are not sure about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, thanks for the explanation.
int n = 0; | ||
while (itor != channels_.end() && | ||
n++ < FLAGS_channel_max_checking_size) { | ||
auto c = itor->second.first.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this creates a shared_ptr every time, for FLAGS_channel_max_checking_size
number of times, is that indented?
while (itor != channels_.end() && | ||
n++ < FLAGS_channel_max_checking_size) { | ||
auto c = itor->second.first.lock(); | ||
// the channel has been released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: under what circumstances will the channel be released? (or put it another way, clarify what "release" means here, which is less clear than "closed")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Released means all shared_ptrs that are associated with the weak_ptr has been destructed. So the underlying channel object has been destructed and its connection has been closed.
There are two layers: thrift_client_pool and thrift_router.
In the initial design of thrift_client_pool, the assumption was that its client (thrift_router, etc) will release (destruct) the returned client shared_ptr object shortly after the connection becomes bad (such as closed by the remote peer). This assumption turned out to be not always true. thrift_router stores the shared_ptr client objects from thrift_client_pool in a thread local data structure. Though active threads will release (destruct) the shard_ptr objects shortly after a connection becomes bad, idle thread won't. (Some services have hundreds of worker threads, and they are scheduled in a LIFO way, i.e., the worker pool always tries to use the idle threads that were most recently used).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the detailed explanation!
initial summary seems truncated |
oh PR title is truncated as well |
while (itor != channels_.end() && | ||
n++ < FLAGS_channel_max_checking_size) { | ||
auto c = itor->second.first.lock(); | ||
// the channel has been released |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the detailed explanation!
int n = 0; | ||
while (itor != channels_.end() && | ||
n++ < FLAGS_channel_max_checking_size) { | ||
auto c = itor->second.first.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I misread -- I thought this function only cleans the channel for the specified addr
. It seems it will clean-up up to FLAGS_channel_max_checking_size
channels? in the common case I think that may be wasted work assuming connections are healthy?
Not introduce by this PR though, so just to understand the original motivation.
…o avoid accumulating CLOSE_WAIT