Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: Improve concurrent merge performance by weakly owning branch updates #8268

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

arielshaqed
Copy link
Contributor

What

When enabled, this feature improves performance of concurrent merges.

No seriously, what?

lakeFS retains branch heads on KV. An operation that updates the branch head - typically a commit or a merge - performs a 3-phase commit protocol:

  • Get the branch head from KV
  • Create the metarange on block storage based on that branch head
  • Create a new commit with that metarange whose parent is that branch head, IF the branch head has not moved

When the third phase fails, the operation fails. We do retry automatically on the server, but obviously under load it will still fail. Indeed we see this happen to some users who have heavy merge loads. In this case each retry is actually more expensive, because the gap between the source commit and the destination branch tends to grow with every successive commit to the destination. Failure is pretty much guaranteed once sustained load crosses some critical threshold.

How

This PR is an experiment to use "weak ownership" to avoid having to retry. Broadly speaking, every branch update operation takes ownership of its branch. This excludes all other operations from updating that branch. This exclusion does not reduce performance - only one concurrent branch operation can ever succeed, and almost all concurrent branch operations can succeed. In fact, it increases performance because it prevents wasting resources on all the concurrent failing operations.

What about CAP?

Distributed locking is impossible by the CAP theorem. That's why we use weak ownership. This gives up on C (consistency) and some P (partition resistance) in favour of keeping A (availability). Specifically, a thread takes ownership of a branch by asserting ownership as a record on KV. This ownership has an expiry time; the thread refreshes its ownership until it is done, at which point it deletes the record.

This ownership is weak: usually a single thread owns the branch, but sometimes 2 threads can end up owning the same branch! If the thread fails to refresh on time (due to a partition or slowness or poor clock sync, for instance) then another thread will also acquire ownership! When that happens we have concurrent branch updates. Because these are correct in lakeFS, we lose performance and may need to retry, but we never give incorrect answers.

Results

This PR also adds a lakectl abuse merge command to allow us to measure. b0cfca3 has some numbers when running locally: with sustained concurrency 50, we run faster and get 0 failures instead of 6.6% failures. More details there why this is even better than it sounds. Graph available here, here's a spoiler:
graph showing speedup when enabling weak ownership

@arielshaqed arielshaqed added area/cataloger Improvements or additions to the cataloger performance include-changelog PR description should be included in next release changelog labels Oct 7, 2024
Copy link

github-actions bot commented Oct 7, 2024

🎊 PR Preview 0ec602d has been successfully built and deployed to https://treeverse-lakeFS-preview-pr-8268.surge.sh

🕐 Build time: 0.011s

🤖 By surge-preview

Copy link

github-actions bot commented Oct 7, 2024

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed

Copy link

github-actions bot commented Oct 7, 2024

E2E Test Results - Quickstart

11 passed

@yonipeleg33
Copy link
Contributor

@arielshaqed looks like amazing work, thank you!
I haven't reviewed it yet, but a few general notes:

  1. Do we have an experiment plan for this feature? Which customers/users, for how long, etc.
  2. The term "weak" is a bit inaccurate here IMO:
    The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership.
    The term "lease-based ownership" suites better here IMO.

@arielshaqed
Copy link
Contributor Author

I should probably mention that if we go in this direction, we can deploy this and then improve it. The next step is to add fairness: earlier merge attempts should have a better chance of being next. Otherwise when there are multiple concurrent merges some of the earlier merge will repeatedly lose and time out. If everything is fair then we expect fewer timeouts, and in general the tail latency will shorten.
One way to make things fairer is to manage an actual queue of tasks waiting to own the branch. Then either they're thread gets priority to pick up the task (it's allowed to pick up tasks sooner after their expiry, for instance task n in the queue waits until $t_{expiry} + n\cdot dt_{check}$) or the thread that owns the branch just processes all merges in the queue.
Anyway, we don't need that to run this experiment.

"now": now,
}).Trace("Try to take ownership")
if errors.Is(err, kv.ErrNotFound) {
err = kv.SetMsg(ctx, w.Store, weakOwnerPartition, prefixedKey, &ownership)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, no? multiple routines can succeed at the same time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, yeah, this was supposed to be a SetIf with the "not present" predicate - and I forgot to find out the correct value for that predicate.
Thanks for catching this!

@arielshaqed arielshaqed added the minor-change Used for PRs that don't require issue attached label Oct 8, 2024
Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that one's embarrassing.

"now": now,
}).Trace("Try to take ownership")
if errors.Is(err, kv.ErrNotFound) {
err = kv.SetMsg(ctx, w.Store, weakOwnerPartition, prefixedKey, &ownership)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ouch, yeah, this was supposed to be a SetIf with the "not present" predicate - and I forgot to find out the correct value for that predicate.
Thanks for catching this!

Weak ownership is a best-effort lock that occasionally fails.  This one
can fail when a goroutine is delayed for a long interval.

This is fine if the calling code relies on ownership for performance but not
for correctness.  E.g. merges and commits.
This includes merges.  Only one concurrent `BranchUpdate` operation can
succeed, so unless many long-lived such operations can fail there is little
point in running multiple concurrent updates.
Performs multiple small merges in parallel.
This shows results, even on local!

When I run lakefs (by default weak ownership is OFF) I get 6.6% errors with
concurrency 50.  Rate is <50/s.  Also the long tail is _extremely_ long.

When I switch weak ownership ON, using the default parameters, I get **0**
errors with concurrency 50.  Rate is about the same, except that the
tail (when load drops) is _short_.

See the difference [here][merge-abuse-speed-chart]: it's faster _and_
returns 0 errors.  The distribution of actual successful merge times is
somewhat slower - possibly because of the time to lock, possibly because of
the fact that errors in the really slow cases cause those slow cases to be
dropped.

Finally, note that because we do not queue, some merges take a *long* time
under sustained load.  We could improve weak ownership to hold an actual
queue of work.  This would make merges _fair_: merges will occur roughly in
order of request arrival.

==== Weak ownership OFF ====

``sh
❯ go run ./cmd/lakectl abuse merge --amount 1000 --parallelism 50 lakefs://abuse/main
Source branch: lakefs://abuse/main
merge - completed: 34, errors: 0, current rate: 33.81 done/second
merge - completed: 80, errors: 0, current rate: 45.98 done/second
merge - completed: 128, errors: 0, current rate: 48.02 done/second
merge - completed: 177, errors: 0, current rate: 49.03 done/second
merge - completed: 222, errors: 0, current rate: 44.97 done/second
merge - completed: 265, errors: 3, current rate: 43.03 done/second
merge - completed: 308, errors: 9, current rate: 42.97 done/second
merge - completed: 357, errors: 15, current rate: 49.01 done/second
merge - completed: 406, errors: 21, current rate: 49.03 done/second
merge - completed: 451, errors: 22, current rate: 44.97 done/second
merge - completed: 499, errors: 29, current rate: 48.01 done/second
merge - completed: 545, errors: 30, current rate: 46.01 done/second
merge - completed: 585, errors: 31, current rate: 39.97 done/second
merge - completed: 632, errors: 33, current rate: 47.04 done/second
merge - completed: 679, errors: 37, current rate: 47.00 done/second
merge - completed: 728, errors: 46, current rate: 48.96 done/second
merge - completed: 768, errors: 49, current rate: 40.04 done/second
merge - completed: 808, errors: 53, current rate: 39.98 done/second
merge - completed: 854, errors: 57, current rate: 45.99 done/second
merge - completed: 891, errors: 58, current rate: 37.00 done/second
merge - completed: 935, errors: 64, current rate: 44.00 done/second
merge - completed: 972, errors: 66, current rate: 36.98 done/second
merge - completed: 990, errors: 66, current rate: 18.00 done/second
merge - completed: 995, errors: 66, current rate: 5.00 done/second
merge - completed: 996, errors: 66, current rate: 1.00 done/second
merge - completed: 998, errors: 66, current rate: 2.00 done/second
merge - completed: 999, errors: 66, current rate: 1.00 done/second
merge - completed: 999, errors: 66, current rate: 0.00 done/second
merge - completed: 999, errors: 66, current rate: 0.00 done/second
completed: 1000, errors: 66, current rate: 5.27 done/second

Histogram (ms):
1       0
2       0
5       0
7       0
10      0
15      0
25      0
50      0
75      601
100     671
250     672
350     672
500     696
750     740
1000    765
5000    896
min     54
max     12022
total   934
```

==== Weak ownership ON ====

```sh
❯ go run ./cmd/lakectl abuse merge --amount 1000 --parallelism 50 lakefs://abuse/main
Source branch: lakefs://abuse/main
merge - completed: 36, errors: 0, current rate: 35.23 done/second
merge - completed: 86, errors: 0, current rate: 49.98 done/second
merge - completed: 136, errors: 0, current rate: 50.03 done/second
merge - completed: 185, errors: 0, current rate: 48.99 done/second
merge - completed: 236, errors: 0, current rate: 51.02 done/second
merge - completed: 286, errors: 0, current rate: 49.99 done/second
merge - completed: 337, errors: 0, current rate: 50.97 done/second
merge - completed: 390, errors: 0, current rate: 53.03 done/second
merge - completed: 438, errors: 0, current rate: 48.01 done/second
merge - completed: 487, errors: 0, current rate: 49.00 done/second
merge - completed: 534, errors: 0, current rate: 46.98 done/second
merge - completed: 581, errors: 0, current rate: 46.99 done/second
merge - completed: 632, errors: 0, current rate: 51.00 done/second
merge - completed: 680, errors: 0, current rate: 48.04 done/second
merge - completed: 725, errors: 0, current rate: 44.98 done/second
merge - completed: 771, errors: 0, current rate: 45.99 done/second
merge - completed: 815, errors: 0, current rate: 44.02 done/second
merge - completed: 861, errors: 0, current rate: 46.01 done/second
merge - completed: 905, errors: 0, current rate: 43.98 done/second
merge - completed: 947, errors: 0, current rate: 42.00 done/second
merge - completed: 977, errors: 0, current rate: 30.01 done/second
merge - completed: 997, errors: 0, current rate: 19.99 done/second
completed: 1000, errors: 0, current rate: 4.92 done/second

Histogram (ms):
1       0
2       0
5       0
7       0
10      0
15      0
25      0
50      0
75      457
100     464
250     468
350     468
500     642
750     647
1000    729
5000    952
min     54
max     13744
total   1000
```
- Add some jitter when acquiring ownership on a branch
- Refresh _for_ refresh interval, twice _every_ refresh interval
- nolint unjustified warnings
@arielshaqed
Copy link
Contributor Author

@arielshaqed looks like amazing work, thank you! I haven't reviewed it yet, but a few general notes:

Thanks!

  1. Do we have an experiment plan for this feature? Which customers/users, for how long, etc.

Yes. Firstly, this PR adds a lakectl abuse merge command to measure. After we pull I will deploy to Cloud staging, switch it on for a single private installation, and measure. Also as you probably know, we have a user for whom I plan this. I will communicate with them to measure on their system; obviously timing for that depends on the customer timing and how comfortable they feel switching it on.

  1. The term "weak" is a bit inaccurate here IMO:
    The term "weak" as defined in places like CPP's std::weak_ptr or Rust's std::sync::Weak means a "non-owning reference", which is different than how this mechanism of branch locking works - each worker owns the branch exclusively, but only for a short period, and with a graceful mechanism for giving up exclusive ownership.
    The term "lease-based ownership" suites better here IMO.

In the C++ and Rust standard libraries, "weak" in this case refers to the pointer. (In other cases in the C++ standard library, "weak" can refer to any variant which is logically weaker, for instance we also have class weak_ordering. In this PR "weak" is supposed to weaken "ownership".

What we have in this PR indeed uses a lease-based implementation. But it's not ownership, it just behaves like ownership in almost all reasonable cases.

A better term might be "mostly correct ownership"1. In fact, I believe that such an unwieldy name would be a good choice! If you agree I will change to use it.

Footnotes

  1. The technical term "mostly" here is intended in same sense that Miracle Max uses it.

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed everything except for the tests. Some really good things here!

//
// So it *cannot* guarantee correctness. However it usually works, and if
// it does work, the owning goroutine wins all races by default.
type WeakOwner struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this shouldn't be in kv pkg. It deserves a standalone ownership/owner/ anything else package. The kv is an implementation detail of the ownership, so this struct should be called KVWeakOwner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Let's do this after we agree on a name? @yonipeleg33 requested a name change, I'd rather do a single name change towards the end of the review.

@@ -3031,6 +3031,25 @@ lakectl abuse list <source ref URI> [flags]



### lakectl abuse merge

Merge nonconflicting objects to the source branch in parallel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Merge nonconflicting objects to the source branch in parallel
Merge non-conflicting objects to the source branch in parallel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will auto-generate this :-)

logging.FromContext(ctx).
WithFields(logging.Fields{}).
WithError(err).
Warn("Failed to get ownership on branch; continuing but may be slow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you check the error value? For instance, I don't think we should continue the update flow for context cancellation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? An error like that will keep happening, and the very next operation on the context will fail. I prefer not to complicate this code with handling of context cancellation (which should really examine either err or if ctx.Done() { ...ctx.Err()...}`). All this code does is emit a log

... error=start owning key... context cancelled... message=Failed to get ownership on branch; continuing but may be slow

which will be followed by a log somewhat like

... error=get item: context cancelled... message=...

I don't think it's worth adding confusion here.

requestID = *requestIDPtr
}
if m.branchOwnership != nil {
own, err := m.branchOwnership.Own(ctx, requestID, string(branchID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
own, err := m.branchOwnership.Own(ctx, requestID, string(branchID))
ownRelease, err := m.branchOwnership.Own(ctx, requestID, string(branchID))

or

Suggested change
own, err := m.branchOwnership.Own(ctx, requestID, string(branchID))
ownClose, err := m.branchOwnership.Own(ctx, requestID, string(branchID))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Changed to release(), which I hope is even nicer.

Comment on lines 17 to 18
//nolint:stylecheck
var finished = errors.New("finished")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix linting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a sentinel not an actual error. See for instance io.EOF. It is used only in sleepFor to indicate that the context expired with no error. But it turns out that there are special cases in contest.WithCancelCause so we don't need it.

Removed, thanks!

// Do NOT attempt to delete, to avoid
// knock-on effects destroying the new
// owner.
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you check the error and exit the function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reported the error, and I do not think there is any useful action to take.

Aborting the action here, in particular, will probably livelock in some situations!

This goroutine has nothing waiting for it, and there really is nothing much to do with broken ownership. "All" that happens is that performance will suffer! This goroutine for the former owner thinks that another action stole its lease. So it makes sense to continue - it is likely to win the race to update, because it has already performed some work. Meanwhile the other action thinks that the former owner has stopped and will never finish - it should continue to avoid zombielock (a name I just invented for livelock against a stuck thread).

Which one is right? Well, if this goroutine is still running then the other action will most likely suffer. If it isn't running, it cannot do anything useful anyway.

log.WithFields(logging.Fields{
"new_owner": ownership.Owner,
}).Info("Lost ownership race")
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining why the code continues to try and own the key. Assuming that it should, I worry that this across multiple routines may have a cascading effect of kv calls

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a comment to explain. But note that "all" that happens is that we fall back to the previous behaviour. Frankly I believe that the common cause for A losing a race to B will be that the clock on B is a bit too fast; this does not cause chains of lost races. The other common cause for A losing a race to B will be that A was stuck; again, that does not give it or any other thread an advantage against B. The final case is when all are badly stuck - for instance if you configure very short acquire and refresh intervals. In that case you do get what you would expect: a stuck system.


var abuseMergeCmd = &cobra.Command{
Use: "merge <branch URI>",
Short: "Merge nonconflicting objects to the source branch in parallel",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Short: "Merge nonconflicting objects to the source branch in parallel",
Short: "Merge non-conflicting objects to the source branch in parallel",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

amount := Must(cmd.Flags().GetInt("amount"))
parallelism := Must(cmd.Flags().GetInt("parallelism"))

fmt.Println("Source branch:", u)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fmt.Println("Source branch:", u)
fmt.Println("Source branch: ", u)

cmd/lakectl/cmd/abuse_merge.go Show resolved Hide resolved
Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough review!

PTAL...


var abuseMergeCmd = &cobra.Command{
Use: "merge <branch URI>",
Short: "Merge nonconflicting objects to the source branch in parallel",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

cmd/lakectl/cmd/abuse_merge.go Show resolved Hide resolved
@@ -3031,6 +3031,25 @@ lakectl abuse list <source ref URI> [flags]



### lakectl abuse merge

Merge nonconflicting objects to the source branch in parallel
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will auto-generate this :-)

// branch IF a previous worker crashed while owning that branch. It
// has no effect when there are no crashes. Reducing it increases
// write load on the branch ownership record when concurrent
// operations occur. This.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 440 to 444
requestIDPtr := httputil.RequestIDFromContext(ctx)
requestID := "<unknown>"
if requestIDPtr != nil {
requestID = *requestIDPtr
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Either way will have strange performance characteristics, of course.

// owning key.
func (w *WeakOwner) Own(ctx context.Context, owner, key string) (func(), error) {
prefixedKey := []byte(fmt.Sprintf("%s/%s", w.Prefix, key))
err := w.startOwningKey(context.Background(), owner, prefixedKey)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusion about how cancellation would work. Restored, thanks!

return nil, fmt.Errorf("start owning %s for %s: %w", owner, key, err)
}
refreshCtx, refreshCancel := context.WithCancel(ctx)
go w.refreshKey(refreshCtx, owner, prefixedKey)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not done, I am unsure how to extract the same logic into both funcs. In particular, they have different error conditions and different handling for each error.

stopOwning := func() {
defer refreshCancel()
// Use the original context - in case cancelled twice.
err := w.Store.Delete(ctx, []byte(weakOwnerPartition), prefixedKey)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why not: this key is on my partition, with my prefix, and I set it starting at l. 245. It serves to indicate ownership, and its value holds only ownership. I think I can delete it.

message WeakOwnership {
string owner = 1;
google.protobuf.Timestamp expires = 2;
string comment = 3;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose a system is behaving really strangely. I can look at its ownership partition and determine which operation owns it -- that's what I put here when I took ownership. I call it "comment" rather than something else because I don't want to commit to any format of what exactly goes in there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added!

This does not depend on KV implementation, so just add a matrix to one of
the AWS/S3 flavours.
Otherwise no way to tell Esti _really_ ran it with ownership.
@itaiad200
Copy link
Contributor

@arielshaqed could you open an issue and link to it? This is hardly a minor-change :)

Copy link
Contributor

@itaiad200 itaiad200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review wip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 215 to 217
if errors.Is(err, kv.ErrNotFound) {
predicate = nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hope kv contract tests cover this

stopOwning := func() {
defer refreshCancel()
// Use the original context - in case cancelled twice.
err := w.Store.Delete(ctx, []byte(weakOwnerPartition), prefixedKey)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It holds ownership, but it's not guaranteed to be your routine ownership (unless I'm missing something). Some other thread might successfully set itself as owner and your cleanup code breaks it.

@arielshaqed
Copy link
Contributor Author

@arielshaqed could you open an issue and link to it? This is hardly a minor-change :)

#8288 :-)

@arielshaqed
Copy link
Contributor Author

I really hope kv contract tests cover this

That line no longer occurs directly. We do still have it implied. I believe that it is tested by kvtest here and here.

If ownership has already been lost to another thread, do NOT delete
ownership when released.

- KV does not provide a DeleteIf operation.  Instead, use SetIf with an
  always-expired timestamp.
- Along the way, ensure "owner" string is truly unique by stringing a nanoid
  onto it.  Currently owner is the request ID, which should be unique - but
  adding randomness ensures it will always be unique regardless of the
  calling system.
Copy link
Contributor Author

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

PTAaaAAaaAAL...

stopOwning := func() {
defer refreshCancel()
// Use the original context - in case cancelled twice.
err := w.Store.Delete(ctx, []byte(weakOwnerPartition), prefixedKey)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A DeleteIf operation sure would be nice here. Since we didn't spec one out, I'll replace it with a sentinel value instead.

@itaiad200
Copy link
Contributor

Sorry for the late feedback, I am reviewing it very cautiously.

I wonder if wrapping BranchUpdate is the appropriate location to "Own" a branch. Every major branch changing operation, like commit & merge has 2 BranchUpdate calls. One for changing the staging token and one for the actual operation. If these 2 operations don't happen one after the other.
The current ownership model only wraps a single BranchUpdate call, thus releasing the lock prematurely.

@arielshaqed
Copy link
Contributor Author

arielshaqed commented Oct 14, 2024

Sorry for the late feedback, I am reviewing it very cautiously.

I wonder if wrapping BranchUpdate is the appropriate location to "Own" a branch. Every major branch changing operation, like commit & merge has 2 BranchUpdate calls. One for changing the staging token and one for the actual operation. If these 2 operations don't happen one after the other. The current ownership model only wraps a single BranchUpdate call, thus releasing the lock prematurely.

I'm not sure this matters. There is no connection between the two update operations: my commit can succeed even if you were the last to seal staffing tokens. And of course sealing is cheap. So I would not expect this to affect performance.

Of course the easy way to be sure will be to perform the experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cataloger Improvements or additions to the cataloger include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Experiment: Reduce contention on branch HEADs to improve merge performance
3 participants