Skip to content

Best IGPUBuffer Types for UBO data sourcing

AnastaZIuk edited this page Oct 26, 2020 · 1 revision

Benchmark cases

  1. New buffer created each frame and destroyed at end of frame (Impossible to do on Vulkan, need staging buffer anyway)
  2. One buffer updated each frame
  3. One buffer mapped through the whole runtime, updated (and bound) each frame in 4-way round robin manner
  4. Same as case 3 but with coherent flag

Each of above tested in 16 combinations of 4 conditions:

  • IN_CPU_MEM - buffer storage in client memory
  • RASTERIZER_DISCARD - rasterizer off
  • DONT_UPDATE_BUF - not updating buffer each frame
  • ATTRIB_DIVISOR - value of attribute divisor (0 or 1); if equal 1 than vertex buffer is 40 bytes (10x float; max instance count is 10)

Benchmark structure

100 unique meshes rendered (randomly 1-10 instances, 3k-12k vertices) using 16 different vertex shaders. Everything rendered sums up to about about 4.8M vertices (1.6M triangles). Meshes are rendered in 16 "batches" - i.e. each shader one by one. Each batch is about 100k triangles. Render is performed to 64x64 depth-only framebuffer reducing fragment shader and rasterizer workload.
Data for vertex shader is taken only from uniform buffer, however one attribute buffer is present (along with active attribute array) since glDrawArrays* doesn't do anything if no attributes active. Every shader interprets UBO data as a bit different structure and every element of so divided UBO is read by some VS invocation.

Alternative structure

100 dispatches of 16 different compute shaders; each dispatch with random (about 15-30) work group count and work group size of 1024. All dispatches sums up to about 3M invocations. Similar to primary benchmark structure, dispatches are performed in 16 batches and each compute shader interprets UBO data as different structure. Each UBO element is read by some CS invocation. To prevent optimization sourcing from the fact that CS does read data but doesn't output anything, there's condtioned write to SSBO (however condition doesn't ever succeed and compiler shouldn't be aware of that).

Results

Note multiple sheets with results for various GPUs.

Source code

Divided into two programs is available here (all the basic cases) and here (staging buffer).