Don't spin on the main mutex while waiting for new work #8433

This is one solution to an issue identified by Marcos, opened for discussion. Here's the full description of the issue: Once they run out of work to do, Halide worker threads spin for a bit checking if new work has been enqueued before calling cond_wait, which puts them to sleep until signaled. Job owners also spin waiting for their job to complete before going to sleep on a different condition variable. I hate this, but all previous attempts I have made at removing or reducing the spinning have made things slower. One problem with this approach is that spinning is done by releasing the work queue lock, yielding, reacquiring the work queue lock, and doing the somewhat involved check to see if there's something useful for this thread to do, either because new work was enqueued, the last item on a job completed, or a semaphore was released. This hammering of the lock by idle worker threads can starve the thread that actually completed the last task, delaying its ability to tell the job owner the job is done, and can also starve the job owner, causing it to take extra time to realize the job is all done and return back into Halide code. So this adds some wasted time at the end of every parallel for loop. This PR gets these idle threads to spin off the main mutex. I did this by adding a counter to each condition variable. Any time they are signaled, the counter is atomically incremented. Before they first release the lock, the idlers atomically capture the value of this counter. Then in cond_wait they spin for a bit doing atomic loads of the counter in between yields until it changes, in which case they grab the lock and return, or until they reach the spin count limit, in which case they go to sleep. This improved performance quite a bit over main for the blur app, which is a fast pipeline (~100us) with fine-grained parallelism. The speed-up was 1.2x! Not much effect on the more complex apps. I'm not entirely sure it's correct, because I think the counter has to be incremented with the lock held, so that it serializes correctly with the idlers capturing the value of the counter before releasing the lock, and you can call cond_signal/broadcast without holding the mutex (though we don't do that currently). It also has the unfortunate effect of waking up all spinning threads when you signal, instead of just one of them. However we never actually call signal, just broadcast. It also increases the size of a cond var, which might be considered a breaking change in the Halide runtime API.

Commits on Oct 10, 2024

I think this load can be relaxed

abadams committed Oct 10, 2024

Configuration menu

View commit details

Copy full SHA for 945e889

Browse repository at this point

Copy the full SHA

945e889 View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't spin on the main mutex while waiting for new work #8433

Don't spin on the main mutex while waiting for new work #8433

Commits on Oct 8, 2024

Commits on Oct 9, 2024

Commits on Oct 10, 2024

Don't spin on the main mutex while waiting for new work #8433

Are you sure you want to change the base?

Don't spin on the main mutex while waiting for new work #8433

Commits on Oct 8, 2024

Commits on Oct 9, 2024

Commits on Oct 10, 2024