Skip to content

Think You're Reducing Atomic SSBO Op Contention. Think Again

AnastaZIuk edited this page Oct 26, 2020 · 1 revision

**** NOTE All of the below apply to a Geforce 1060 running driver 387.34 on Ubuntu, YMMV!!! **** Although we have testing data from other cards (scroll to the end). 20.07.2018 there is an error in the Intel GPU testing methodology.

What we tried to do

We tried to implement occlusion queries in a modern way with SSBO stores and atomics, essentially different objects are writing to associated uints in an SSBO.

NOTE: Please stop reading if the address you are atomically operating on is DATA DEPENDENT, the following applies only for atomic operations performed on DYNAMICALLY UNIFORM addresses with DYNAMICALLY UNIFORM operands!

Which is actually most of the Append/Consume buffer scenarios.

We did what seemed to be reasonable and is always recommended in every CUDA presentation, which is to make sure that the separate threads/invocations in a warp/subgroup are performing atomic operations on different memory addresses (yet on the same cache line).

However we noticed a problem when upgrading from the naive implementation to one that used the most hardware specific NV_shader_thread_group extension to make sure that not only does each separate thread in a warp write to a different address, but the addresses written are in the same order as the order of the threads in the warp AND different SM's (Compute Units).

The problem was that suddenly the whole code was 10x up to 30x slower! So naturally a benchmark was constructed to investigate the different methods of incrementing that counter for a large amount of pixels on screen.

NOTE: Intel GPUs actually benefit from atomic operations on different addresses, to the point of actually being faster than stores to the same address (most probably because Intel has more levels of GPU caches).

TL;DR

In the case of atomicAdd writing to a dynamically uniform address (possibly only atomicAdd whose return value is unused and are always adding a constant like 1u) in fragment shaders and NVidia GPUs, the shader compiler optimizes enormously and uses the black magic of thread/warp awareness to optimize your atomicAdd into some sort of inter-warp parallel reduction in the background.

Definition of Dynamically Uniform Expression: Introduced in GL 3.0+, it meant that all invocations launched by the same API call would produce the same result to the expression. It later went on to be relaxed to allow us to treat gl_InstanceID and gl_DrawID as dynamically uniform, and with Shader Model 6.0 it means that all invocations in the same sub-group/lane/warp produce the same value.

So in that specific case don't try to be clever and reduce atomic contention.

Most important thing to note is that other atomic operations, even with dynamically uniform addresses and operands WILL NOT benefit from the same speed-up as atomicAdd. Even other NVidia GPUs/drivers do not perform this automagical optimization! So you NEED to do this optimization manually with ARB_shader_ballot or NV_shader_thread_group.

How does this actually happen?

Option 1: Hidden Usage of shared memory! (Unlikely)

Even though the DX, GL and Vulkan spec give us no access to shared memory during fragment shader stage, it could possibly be used by Nvidia's Shader Multiprocessors during the execution of the fragment shader for its own purposes.

That way, most atomics for a whole warp which operate on a dynamically uniform expression could take place in shared memory first (even without shared mem atomics) before being done on global memory, essentially being a parallel reduction followed by a global atomic.

This however, is unlikely for multiple reasons:

  1. This would reduce occupancy for both fragment shader warps and async compute (needing to reserve 128 bytes of shared mem for each warp)
  2. The logic and code to implement and test this for the shader compiler would be enormous and have many corner cases (especially in the case when the return value is being used)

The test data seems to confirm that this is not the approach used as atomicMin and atomicMax operations under the same circumstances do not benefit at all.

One could test and deduce more by checking how having non-constant operands affects performance, and if it does then it would mean that the shared mem approach is indeed not the one.

Option 2: Clever implicit usage of ARB_shader_ballot/NV_shader_thread_group

Most likely what's happening as I could emulate it by hand, and get similar if not exactly the same performance.

Basically we can do what AMD provides with mbcnt through a combination of ballotARB(gl_HelperThread) and bitCount we can actually see how many invocations/threads in a sub-group/warp wanted to execute atomicAdd and elect only one invocation/thread in the subgroup/warp to perform the atomic.

Note that the memory address and the atomic operation operand needs to be the same for the whole subgroup/warp or the implementation of this becomes very convoluted, difficult and even slower than a naive approach.

Option 3: Hitting the old Atomic Counter Object / Occlusion Query / Transform Feedback fast-path

Intel has already stated that it implements ACBs with the same instructions and hardware as more general operations on SSBOs, CUDA and OpenCL do not have a concept of an object resembling an ACB so atomic operations can be performed on ANY global memory address.

This leads me to believe that there are fast-paths necessary for performant ACBs (internally used for Transform Feedback from Geometry Shaders) and Occlusion Queries (the old variant giving precise pixel counts) which are optimized for the specific psuedo code being executed per-thread/per-invocation

if (someDynamicCondition) atomicAdd(output[dynamicallyUniformExpression],1u);

And that this optimization is generally applied in the shader compiler.

First Round of Benchmarking

At first a complete 8 case (actually 7 without the control) benchmark was constructed. Appropriate Old Revision With the Old Benchmark in the Repository

The cases examined were (with an assumption of 4 Compute Units/SMs in the GPU):

  1. Naive all invocations atomically increase on the same address
  2. Using gl_FragCoord to try deduce gl_ThreadInWarpNV for NVIDIA (which actually matched up to gl_ThreadInWarpNV)
  3. Using gl_FragCoord to try deduce gl_SubGroupInvocationARB for Intel GPUs assuming the 4x2 SIMD configuration is always used for the fragment shader and rasterization This was actually deducing the group ID not invocation ID
  4. Same as method 3 but actually using gl_SubGroupInvocationARB to provide the first log2(gl_SubGroupSizeARB) bits of the address for ideal sub-group to cache-line correspondence and ordering
  5. Using explicitly given (by NV_shader_thread_group) current invocation's thread-in-warp-ID and active SM-ID to ensure absolutely 0 atomic contention amongst all active invocations on the whole GPU
  6. Just using gl_SubGroupInvocationARB and not caring about inter-CU global contention
  7. Using gl_SMIDNV to eliminate contention between CU/SMs but not in the warp.
  8. A control where all invocations write "1" to the same address in an SSBO (both coherent and incoherent qualifier gave identical results)

And the results of this old bechmark speak for themselves

It was actually the surprising difference in performance between case 6 and 7 which hinted to me that the compiler/GPU or something else is doing an "intra warp" reduction before actually firing off a single atomicAdd in place of one op per thread!

However these results come from a single GPU (mobile Geforce 1060) and we could not rely on all GPUs handling and optimizing the naive case so swimmingly!

Second Round

Armed with the rather odd benchmark results, we pondered how we could leverage the warp/sub-group information to improve on the naive case or at least ensure the same performance without blind faith in the shader compiler or some exotic built-in instructions.

Revision With the Benchmark in the Repository

First the SSBO was made to be 1MB regardless of method used and an additional offset (dependant on a uniform variable) was added to the address in the SSBO to perform the atomic on. This was done to account for caching, prefetches etc. This made no difference

The first and last case were left the same to allow for comparisons, but the rest were changed.

  1. Naive all invocations atomically increase on the same address
  2. Same as previously, but result down shifted by 5 so that each whole NVidia warp would get the same address
  3. ~~This time only each whole sub-group gets assigned a different address depending on gl_FragCoord and the heuristic is vastly improved with a compile-time preprocessor macro defining different rasterization micro-tile dimensions for INTEL and non-INTEL GPUs (AMD's, as Nvidia always has NV_shader_thread_group available) ~~ This was actually deducing the group ID not invocation ID
  4. Same as method 3 but using ballotARB to perform a pseudo-reduction and to elect only 1 invocation in a sub-group to write the result to memory
  5. Using NV_shader_thread_group's gl_SMIDNV to give each CU a different global address
  6. Only using ballotARB to perform a pseudo-reduction and electing only one invocation in a sub-group to perform atomicAdd, BUT ON A SINGLE GLOBAL ADDRESS FOR ALL CUs
  7. Only using activeThreadsNV to perform a pseudo-reduction and electing only one thread in a warp to perform atomicAdd, BUT ON A SINGLE GLOBAL ADDRESS FOR ALL SMs
  8. A control where all invocations write "1" to the same address in an SSBO (both coherent and incoherent qualifier gave identical results)

The results were very informative and helped us to confirm a number of theories.

** UPDATE 20.07.2018 : ** Actually what we measured on the Intel card (method 3) was invalid, what should have been measured which was to use the lowest significant bits in gl_FragCoord to get unique addresses per-pixel and not per-block of pixels. With that taken into account the results were not 850ms but somewhere around 130ms, which strangely enough is also faster than a simple store to the same address.

Third Round (Unpublished)

We were curious as to why the atomicAdd approach was 100% faster than the control case. Surely a global atomic lock had to be used, and if the compiler could optimize for constant values being written to the same address it would surely be here.

This would mean that people implementing their SSBO occlusion queries naively with atomic counters would actually be 100% faster than the famous OpenGL 4.4 occlusion query technique promoted by nvidia

We thought of reasons for why this anomaly was happening:

  1. In the old CUDA presentations it is mentioned that global memory writes as opposed to reads, are not cached (why scatter was/is slower than gather) but atomic operands must live in L2
  2. Coherent or the Writeonly qualifiers make a difference
  3. Somehow the naive case does a local pre-reduction and performs less global memory accesses than the plain write

As the 1st reason was impossible to verify with testing, we proceeded to test for the 2nd possible reason and found that neither the presence or lack of coherent and/or writeonly made a difference.

It was only after implementing software optimizations using ballotARB we thought to do the same optimization on the control case, i.e. only elect one invocation in a group to do the global store. And thats when the control case ended up being 2-10% faster than the naive atomicAdd first bechmark case.

Later we also tried implement the store with an atomic op without any ARB_shader_ballot optimizations with atomicMax,atomicOr,atomicExchange and atomicCompSwap, simply to see if we will get the same speed as atomicAdd. Obviously none of the return values were used and the operand was always a constant 1u.

Hilariously everything was 40x slower than atomicAdd, despite the fact that the exact same optimization with ballotARB could have been applied to all these cases, it seems the compiler only cares about atomicAdd.

This is a very important thing to note for people who are using these other operations and counting on performance comparable to atomicAdd, you NEED to do this optimization manually with ARB_shader_ballot or NV_shader_thread_group.

Benchmark and Results (on other GPU and driver configurations)

Benchmark Example available as source, compiled Linux binary, and soon a compiled windows EXE.

Benchmark Results

Tested so far:

  • Linux NVidia 1060 Mobile
  • Windows NVidia 660 Desktop (follows expectations more closely)

Conclusion

Because the automagically compile time optimization cannot be guaranteed across vendors, nor even across GPUs and drivers from the same vendor, one must manually implement the balloting optimization.

This is especially important given that the older but capable (having both NV_shader_thread_group and ARB_shader_ballot extensions) NVidia GPUs such as the 660 do not exhibit this peculiar behaviour, which means you cannot count on the shader compiler to do this magic for you.

How to implement it:

  1. If you have NV_shader_thread_group, use it but dont try to use gl_SMIDNV to ease inter-SM contention*
  2. Else if you only have ARB_shader_ballot, use it but don't try to deduce the CU to ease inter CU contention
  3. Unless you have an INTEL GPU on Windows, just issue atomics on the same global address! Assume the GPU is too old and too dumb to notice the atomic ops are happening on unique addresses across the invocation sub-group, and that this will actually serialize your atomics even further as the GPU will have to deal with many more locks than for a single address.
  4. Intel on Windows just has to be a special snowflake, as such it does not expose ARB_ballot on any recent GPUs while the mesa version (Linux) of the driver does. In such case, if point 2 does not apply, use method 3 use gl_FragCoord to pick unique adress with tile dimensions specifically tuned to match Intel's SIMD8 (4x2) or SIMD16 mode (8x2). It should win you about 100% speed over the naive case.

So basically all approaches except 4 call for 0 duplication of counters and 2nd pass merge in SSBO.

  • Using gl_SMIDNV will buy you a couple % perf on older cards but it will loose you a couple % on newer cards, although this may be only applicable to the atomicAdd(dynamicallyUniformAddress,dynamicallyUniformConstant) case

Open Question?

Do the same awesome optimizations happen when we feed NVidia GPUs SPIR-V shaders in Vulkan? Is AMD just as clever?