Don't spin on the main mutex while waiting for new work #8433

abadams · 2024-10-08T18:18:19Z

This is one solution to an issue identified by Marcos, opened for discussion. Here's the full description of the issue:

Once they run out of work to do, Halide worker threads spin for a bit checking if new work has been enqueued before calling cond_wait, which puts them to sleep until signaled. Job owners also spin waiting for their job to complete before going to sleep on a different condition variable. I hate this, but all previous attempts I have made at removing or reducing the spinning have made things slower.

One problem with this approach is that spinning is done by releasing the work queue lock, yielding, reacquiring the work queue lock, and doing the somewhat involved check to see if there's something useful for this thread to do, either because new work was enqueued, the last item on a job completed, or a semaphore was released. This hammering of the lock by idle worker threads can starve the thread that actually completed the last task, delaying its ability to tell the job owner the job is done, and can also starve the job owner, causing it to take extra time to realize the job is all done and return back into Halide code. So this adds some wasted time at the end of every parallel for loop.

This PR gets these idle threads to spin off the main mutex. I did this by adding a counter to each condition variable. Any time they are signaled, the counter is atomically incremented. Before they first release the lock, the idlers atomically capture the value of this counter. Then in cond_wait they spin for a bit doing atomic loads of the counter in between yields until it changes, in which case they grab the lock and return, or until they reach the spin count limit, in which case they go to sleep. This improved performance quite a bit over main for the blur app, which is a fast pipeline (~100us) with fine-grained parallelism. The speed-up was 1.2x! Not much effect on the more complex apps.

I'm not entirely sure it's correct, because I think the counter has to be incremented with the lock held, so that it serializes correctly with the idlers capturing the value of the counter before releasing the lock, and you can call cond_signal/broadcast without holding the mutex (though we don't do that currently). It also has the unfortunate effect of waking up all spinning threads when you signal, instead of just one of them. However we never actually call signal, just broadcast. It also increases the size of a cond var, which might be considered a breaking change in the Halide runtime API.

Alternatives:

Continue to spin in the thread pool instead of in cond_wait, but on one of these counter, not the main lock
Somehow make some mutex lock attempts higher priority than others so that the lock acquires while spinning don't starve the lock acquires to do something useful
Figure out how to remove the spinning for new work entirely without hurting performance. Haven't been able to do this so far. Right now spinning helps pipeline latency on uncontended systems (unsurprising), but also helps total system throughput on linux on contended systems (surprising).

This is one solution to an issue identified by Marcos, opened for discussion. Here's the full description of the issue: Once they run out of work to do, Halide worker threads spin for a bit checking if new work has been enqueued before calling cond_wait, which puts them to sleep until signaled. Job owners also spin waiting for their job to complete before going to sleep on a different condition variable. I hate this, but all previous attempts I have made at removing or reducing the spinning have made things slower. One problem with this approach is that spinning is done by releasing the work queue lock, yielding, reacquiring the work queue lock, and doing the somewhat involved check to see if there's something useful for this thread to do, either because new work was enqueued, the last item on a job completed, or a semaphore was released. This hammering of the lock by idle worker threads can starve the thread that actually completed the last task, delaying its ability to tell the job owner the job is done, and can also starve the job owner, causing it to take extra time to realize the job is all done and return back into Halide code. So this adds some wasted time at the end of every parallel for loop. This PR gets these idle threads to spin off the main mutex. I did this by adding a counter to each condition variable. Any time they are signaled, the counter is atomically incremented. Before they first release the lock, the idlers atomically capture the value of this counter. Then in cond_wait they spin for a bit doing atomic loads of the counter in between yields until it changes, in which case they grab the lock and return, or until they reach the spin count limit, in which case they go to sleep. This improved performance quite a bit over main for the blur app, which is a fast pipeline (~100us) with fine-grained parallelism. The speed-up was 1.2x! Not much effect on the more complex apps. I'm not entirely sure it's correct, because I think the counter has to be incremented with the lock held, so that it serializes correctly with the idlers capturing the value of the counter before releasing the lock, and you can call cond_signal/broadcast without holding the mutex (though we don't do that currently). It also has the unfortunate effect of waking up all spinning threads when you signal, instead of just one of them. However we never actually call signal, just broadcast. It also increases the size of a cond var, which might be considered a breaking change in the Halide runtime API.

abadams · 2024-10-08T18:21:58Z

Oh, another problem with the current behavior is that workers spin 40 times waiting for new work, and each spin grabs the mutex, which may spin 40 times to acquire it, so the upper limit on yield calls before sleeping is 40 x 40 = 1600 (!). This way the upper limit is 80 yields before going to sleep - 40 spins on the counter, and then 40 spins to acquire the mutex again to do the rest of cond_wait.

abadams · 2024-10-08T20:50:59Z

uh oh, looks like a deadlock on the bots

mcourteaux · 2024-10-11T08:36:41Z

One problem with this approach is that spinning is done by releasing the work queue lock, yielding, reacquiring the work queue lock, and doing the somewhat involved check to see if there's something useful for this thread to do, either because new work was enqueued, the last item on a job completed, or a semaphore was released.

To me it sounds like you could get away with checking the bit without locking the mutex. Only when you want to actually synchronize, you lock the mutex, but as long as no work is available (and thus the bit reflecting that), reading the bit without locking sounds fine to me. Once the idle worker finds that there is an indication for more work, you actually take the more expensive code path with locking. Then idle workers would no longer compete for a lock as long as they are idle, giving the chance to the worker that's actually doing something to signal that it finished.

Perhaps I misunderstood.

zvookin · 2024-10-11T12:51:19Z

Unfortunately I can't make the meeting today. I have not convinced myself this provides a correct implementation of a condition variable, specifically with multiple threads calling wait and signal. Though I think it strictly just increases the number of false returns from wait, which are allowed arbitrarily so it likely does not break the contract. That however is a pretty weak design to be standing on. Such returns are is allowed but a good implementation is supposed to minimize them.

One thought is that if this is really improving things, a conditional critical section, which moves the actual test of whether there is more work to do or not inside the synchronization primitive, might help. But mostly my feeling is we need to get rid of the spinning entirely, at least on Linux.

Other issue here is degree of testing required, both for correctness, but more importantly for performance. It is very easy to improve things in one place and worsen them elsewhere. Toward that end, it would be good to collect the performance data behind the change at least minimally, including the platform info it is from, and maybe some baselines for other work to make sure there are no regressions.

abadams · 2024-10-11T15:45:34Z

It definitely wakes too many threads on signal - all the spinners plus one sleeping thread. We don't currently use signal. Viewed as a whole (i.e. including the spinning in thread_pool_common.h that this PR removes), if a spurious wake-up is a lock grab for no good reason, this design is a huge improvement over main (max of 1 vs 40).

There are a few alternative designs that amount to the same thing for current usage, but perhaps won't sabotage any future uses of the existing cond var. One is just moving the raw counter into the thread pool and spinning on it before calling cond_wait. The places where we signal would have to both signal and increment the counter. Another is making a "halide_spin_cond_var", where there's no signal method, and wait returns after 40 spins max even if not signaled. It would return a bool saying whether or not it was signaled or timed-out. This halide_spin_cond_var would be waited on before waiting on the real cond var.

One could also imagine a design where the spin_cond_var tracks the number of spinners. New waiters spin until the counter is the initial value plus the number of spinners (including itself). Broadcast increments the counter by the number of spinners, and signal increments it by one. The number of spinners is guarded by the mutex. I think this is basically a spin semaphore? (EDIT: Actually no I think this still has spurious wake-ups if there are new waiters and new wake-ups while a spinning thread is yielding)

We're trying to get some data on our production workloads now. If that's positive, I'll also instrument the bots to record something useful on the open source apps on various platforms.

abadams added the dev_meeting Topic to be discussed at the next dev meeting label Oct 8, 2024

yield once before checking the counter

44a1cd7

abadams added 3 commits October 9, 2024 09:16

Add final check

c7ebdf9

Remove debugging assert

6d46367

I think this load can be relaxed

945e889

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't spin on the main mutex while waiting for new work #8433

Don't spin on the main mutex while waiting for new work #8433

abadams commented Oct 8, 2024

abadams commented Oct 8, 2024

abadams commented Oct 8, 2024

mcourteaux commented Oct 11, 2024

zvookin commented Oct 11, 2024

abadams commented Oct 11, 2024 •

edited

Loading

Don't spin on the main mutex while waiting for new work #8433

Are you sure you want to change the base?

Don't spin on the main mutex while waiting for new work #8433

Conversation

abadams commented Oct 8, 2024

abadams commented Oct 8, 2024

abadams commented Oct 8, 2024

mcourteaux commented Oct 11, 2024

zvookin commented Oct 11, 2024

abadams commented Oct 11, 2024 • edited Loading

abadams commented Oct 11, 2024 •

edited

Loading