Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[common/thrift_client_pool] close connections that are in bad state t… #563

Merged
merged 1 commit into from
Dec 14, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 23 additions & 17 deletions common/thrift_client_pool.h
Original file line number Diff line number Diff line change
Expand Up @@ -307,24 +307,30 @@ class ThriftClientPool {
}

void cleanupStaleChannels(const folly::SocketAddress& addr) {
// cleanup stale entries if it hasn't been done for a period of time.
auto now = time(nullptr);
if (last_cleanup_time_ + FLAGS_channel_cleanup_min_interval_seconds
< now) {
last_cleanup_time_ = now;
auto itor = channels_.find(addr);
int n = 0;
while (itor != channels_.end() &&
n++ < FLAGS_channel_max_checking_size) {
// We don't cleanup !good() live channels here. Otherwise we
// will need to upgrade it to a shared_ptr. We expect clients
// won't keep a !good() channels for a long period of time.
if (itor->second.first.use_count() == 0) {
itor = channels_.erase(itor);
} else {
++itor;
}
}
// skip cleanup if it was done recently
if (now < last_cleanup_time_ + FLAGS_channel_cleanup_min_interval_seconds) {
return;
}

last_cleanup_time_ = now;
auto itor = channels_.find(addr);
int n = 0;
while (itor != channels_.end() &&
n++ < FLAGS_channel_max_checking_size) {
auto c = itor->second.first.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the lock here to ensure that the weak ptr is not released while it's status is checked in the next few lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lock() converts the weak pointer to a shared pointer: https://en.cppreference.com/w/cpp/memory/weak_ptr/lock
And we need to convert to a shared pointer to be able to access the underlying object.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this creates a shared_ptr every time, for FLAGS_channel_max_checking_size number of times, is that indented?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's intended. itor will be advanced either via erase() or ++ below. So it creates a shared_ptr for different channel each time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I misread -- I thought this function only cleans the channel for the specified addr. It seems it will clean-up up to FLAGS_channel_max_checking_size channels? in the common case I think that may be wasted work assuming connections are healthy?

Not introduce by this PR though, so just to understand the original motivation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a connection is health, the client (thrift_router, etc) of thrift_client_pool will cache the returned shared_ptr so that it can be reused. In this case, the weak_ptr will be able to lock() into a shared_ptr, so this function won't do anything to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, but still a busy-loop for maximally FLAGS_channel_max_checking_size times. Not a big deal though, because this only happens when getClient is called, which only happens when thrift_router tries to create a client for the first time or try to fix a bad client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, created #565 as an alternative to this approach, i.e. since we already know when a channel is bad, we can close it right before creating a new one for a destination address, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to try it out. I can abandon this one if #565 works.
One subtle difference is that #565 only close connections that are being requested. This PR will clean up bad connections to any destinations which may not be requested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand more now, so I tried #565, though it closes the connection promptly when it's bad, it's not helping much, they may be destroyed anyway when all share_ptr to the channel are released. What's more, as you mentioned in another conversation below, it's mostly due to the idle thread holding on the "bad" channels.

And the channels in the idle threads became "bad" because the server side closing the idle connection after idle_timeout.

I also tested this PR in a private build, though it doesn't fully eliminate all the CLOSE_WAIT, it keeps the count to be much smaller (typically <50 per process on the realpin service I tried, compared to 100s or even a couple 1000s before), so I think this PR seems worth it. Feel free to land.

// the channel has been released
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: under what circumstances will the channel be released? (or put it another way, clarify what "release" means here, which is less clear than "closed")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Released means all shared_ptrs that are associated with the weak_ptr has been destructed. So the underlying channel object has been destructed and its connection has been closed.

There are two layers: thrift_client_pool and thrift_router.

In the initial design of thrift_client_pool, the assumption was that its client (thrift_router, etc) will release (destruct) the returned client shared_ptr object shortly after the connection becomes bad (such as closed by the remote peer). This assumption turned out to be not always true. thrift_router stores the shared_ptr client objects from thrift_client_pool in a thread local data structure. Though active threads will release (destruct) the shard_ptr objects shortly after a connection becomes bad, idle thread won't. (Some services have hundreds of worker threads, and they are scheduled in a LIFO way, i.e., the worker pool always tries to use the idle threads that were most recently used).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed explanation!

if (!c) {
itor = channels_.erase(itor);
continue;
}

// the channel is bad
if (!itor->second.second->is_good.load()) {
c->closeNow(); // close the connection to avoid accumulating CLOSE_WAIT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not necessary to remove the channel from the map in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to minimize behavior change to reduce the risk.

If we remove it here, then we will more aggressively establish new connections to the the destination.

When the connection was released (weak_ptr.lock() fails), we know that it's been a while since the connection was established.

When the connection is not released but !good() (here), we are not sure about that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, thanks for the explanation.

}

++itor;
}
}

Expand Down