Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incarnations sometimes lead to nodes not seeing other nodes during mass restarts #311

Open
vitalif opened this issue Oct 21, 2024 · 0 comments

Comments

@vitalif
Copy link

vitalif commented Oct 21, 2024

Hi.

I'm testing memberlist in a ~4500 node cluster and it seems I found a nontrivial issue: some nodes seem to not reset their "incarnation" numbers to correct (high enough) values during mass node restarts - and it leads to some nodes thinking that some other nodes are dead for a long time, while actually they aren't.

As I understand, the scenario is the following:

  • You begin to restart the whole cluster
  • During restart, it happens so that node 1 marks node 2 dead at large incarnation (say 100) and sends it to gossip (deadNode 2 incarnation=100)
  • Node 2 restarts and syncs with (initial) 1, but it's already restarted too and it doesn't see node 2 at all, so node 2 leaves its incarnation at 1
  • Node 3 restarts and it happens so that it first receives new metadata of restarted node 2 with incarnation 1, and AFTER that it receives the deadNode gossip message with node 2 and incarnation 100
  • Node 3 tries to re-gossip this deadNode 2/100 message, but again - it happens so that it resends it to nodes 4, 5, and 6 which either also restart just at that moment and so don't resend the gossip message further, OR they were just restarted and don't know about node 2 at all (and thus ignore deadNode 2 message)
  • Anyway, the idea is that the deadNode 2/100 message doesn't get to the newly restarted node 2 and thus node 2 doesn't refute it and doesn't increase its incarnation number to 101
  • We end up with node 3 thinking that node 2 is dead, while it is not
  • Memberlist doesn't recheck dead nodes in any way, so it's left in this state for indefinite time, at least until the full TCP sync with some other node fixes it - but it's an ugly solution and full TCP sync also has a problem Full TCP sync in a large-scale cluster may "revive" a gone node for a long time #312

I can't guarantee that it's the exact case which I observed (due to the lack of logging in this library) but I'm almost sure it's at least very close.

The general idea is that the "incarnation" mechanism is not a guarantee during mass restarts.

My ideas about how it could be fixed:

  1. Add retries/rechecks of dead nodes, do not rely on just 1 "deadNode" message sent to the gossip and then refuted.
  2. Use system time instead of the incarnation number.
  3. Mark node as alive when it pings you.
  4. Request an update from the given node if you receive an "aliveNode" message from it with a smaller incarnation number than you already know.
  5. Re-broadcast deadNode messages even if we don't know about the given node at all.
  6. At least add logs for all "deadNode" / "suspectNode" messages and for all .
  7. A combination of all previous variants. :-)

Generally, it seems that there's plenty of room for improvements in this library, but of course it requires active maintenance. Do you still maintain the library?

@vitalif vitalif changed the title Incarnations sometimes lead to node not seeing other nodes during mass restarts Incarnations sometimes lead to nodes not seeing other nodes during mass restarts Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant