Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

vitalif · 2024-10-21T10:23:06Z

Hi.

I'm testing memberlist in a ~4500 node cluster and it seems I found a nontrivial issue: some nodes seem to not reset their "incarnation" numbers to correct (high enough) values during mass node restarts - and it leads to some nodes thinking that some other nodes are dead for a long time, while actually they aren't.

As I understand, the scenario is the following:

You begin to restart the whole cluster
During restart, it happens so that node 1 marks node 2 dead at large incarnation (say 100) and sends it to gossip (deadNode 2 incarnation=100)
Node 2 restarts and syncs with (initial) 1, but it's already restarted too and it doesn't see node 2 at all, so node 2 leaves its incarnation at 1
Node 3 restarts and it happens so that it first receives new metadata of restarted node 2 with incarnation 1, and AFTER that it receives the deadNode gossip message with node 2 and incarnation 100
Node 3 tries to re-gossip this deadNode 2/100 message, but again - it happens so that it resends it to nodes 4, 5, and 6 which either also restart just at that moment and so don't resend the gossip message further, OR they were just restarted and don't know about node 2 at all (and thus ignore deadNode 2 message)
Anyway, the idea is that the deadNode 2/100 message doesn't get to the newly restarted node 2 and thus node 2 doesn't refute it and doesn't increase its incarnation number to 101
We end up with node 3 thinking that node 2 is dead, while it is not
Memberlist doesn't recheck dead nodes in any way, so it's left in this state for indefinite time, at least until the full TCP sync with some other node fixes it - but it's an ugly solution and full TCP sync also has a problem Full TCP sync in a large-scale cluster may "revive" a gone node for a long time #312

I can't guarantee that it's the exact case which I observed (due to the lack of logging in this library) but I'm almost sure it's at least very close.

The general idea is that the "incarnation" mechanism is not a guarantee during mass restarts.

My ideas about how it could be fixed:

Add retries/rechecks of dead nodes, do not rely on just 1 "deadNode" message sent to the gossip and then refuted.
Use system time instead of the incarnation number.
Mark node as alive when it pings you.
Request an update from the given node if you receive an "aliveNode" message from it with a smaller incarnation number than you already know.
Re-broadcast deadNode messages even if we don't know about the given node at all.
At least add logs for all "deadNode" / "suspectNode" messages and for all .
A combination of all previous variants. :-)

Generally, it seems that there's plenty of room for improvements in this library, but of course it requires active maintenance. Do you still maintain the library?

vitalif mentioned this issue Oct 21, 2024

memberlist fails in large-scale clusters #299

Open

vitalif changed the title ~~Incarnations sometimes lead to node not seeing other nodes during mass restarts~~ Incarnations sometimes lead to nodes not seeing other nodes during mass restarts Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

vitalif commented Oct 21, 2024 •

edited

Loading

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

Comments

vitalif commented Oct 21, 2024 • edited Loading

vitalif commented Oct 21, 2024 •

edited

Loading