Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve timely shutdown of directory partitions when snapshot transfer has been abandoned #9197

Merged

Conversation

ReubenBond
Copy link
Member

@ReubenBond ReubenBond commented Oct 22, 2024

This fixes a bug preventing timely shutdown of silos with the experimental directory replacement implemented in #9103 in some cases.
Specifically, when leaving the cluster, replicas snapshot their directory partitions and wait for new owners to collect and acknowledge the snapshots. If the new owner decides to perform recovery instead of hand-off then the snapshot transfer is abandoned. Until this PR, there was no way to signal that the transfer had been abandoned. It is always safe to abandon a transfer, in which case recovery will be performed instead (at some added expense).

This PR tells the snapshot sender to abandon the snapshot in two scenarios:

  1. It detects that there has been a non-contiguous membership version change. This is when membership versions jump directly from N to N + k where k > 1. In this scenario, the snapshot sender does not know who should receive the snapshot with certainty, so it abandons the request, forcing recovery to be performed. Most commonly, this will occur during fast scale-out and scale-in scenarios where multiple silos are added or removed in very quick succession.
  2. It sees a recovery operation from a would-be snapshot receiver. This implies that the snapshot attempt has been abandoned by the receiver and therefore should be abandoned by the sender, too.
Microsoft Reviewers: Open in CodeFlow

@ReubenBond ReubenBond enabled auto-merge (squash) October 22, 2024 02:52
@ReubenBond ReubenBond merged commit ae5515a into dotnet:main Oct 22, 2024
22 checks passed
@ReubenBond ReubenBond deleted the fix/distributed-directory-shutdown branch October 22, 2024 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants