Simulation / Modeling / Design

Advanced API Performance: Descriptors

A graphic of a computer sending code to multiple stacks.

By using descriptor types, you can bind resources to shaders and specify how those resources are accessed. This creates efficient communication between the CPU and GPU and enables shaders to access the necessary data during rendering.

  • Prefer a “bindless” design.
    • Use unbounded array descriptors pointing to big descriptor tables or sets with all known textures, buffers, and acceleration structures needed for the frame.
    • Upload as much data upfront as possible (textures, per-draw constants, and per-frame constants) and make them accessible through these descriptor arrays.
    • This design also makes it easier to implement ray tracing; that is, allowing access to every texture and buffers from each shader.
    • Cache descriptors on GPU-visible descriptor heaps (DirectX 12) or sets (Vulkan) with a known offset. This lowers the CPU overhead and virtually eliminates the need for copying descriptors.
    • Use multiple copies of the heap to handle descriptor changes gracefully, such as streaming textures and buffers. But don’t exceed the 1M and 2K limits. For more information, see the Not Recommended section later in this post.
  • Use root (DirectX 12) or push (Vulkan) constants. They are the fastest way to transfer per-draw varying constants.
  • On Pascal: Prefer CBVs over SRVs for constant data.
    • Generally, SRV buffers are slower than CBV buffers on <= Pascal.
    • Performance is equivalent on Volta and up.
    • Better yet, try using root constants.
      • They can be faster, even for infrequently changing data (for example, material data, pass data, and per-frame data).

DirectX 12

  • Feel free to maximize the use of the full 64 DWORD data types available in the root signature.
  • Performance ranking on both GPU and CPU:
    1. Root constants are the fastest with no indirections, and they are directly indexable.
    2. Root CBV/SRV/UAV are the second fastest, with single indirection and no bounds checking.
    3. Descriptor tables are the slowest, with two indirections and bounds checking.
  • Use dynamic resource binding, such as HLSL SM 6.6.
    • This enables you to omit some descriptor tables from the root signature, for more space for root constants and other data. For more information, see In the works: HLSL Shader Model 6.6.
  • Switching root signatures is a fast operation.
    • The usage of multiple root signatures to improve binding efficiency could be a valid strategy.
    • This especially holds true for a non-bindless design.
    • It could be inefficient when having to rebind a lot of data unnecessarily. Switching root signature causes existing bindings to be lost.
  • Use Root Signature 1.1 to get slightly more performance in some cases.
    • In particular, using DATA_STATIC_SET_AT_EXECUTE where possible enables the driver to inline some data early.
    • This is not a high priority; only use it whenever it is convenient to do so.

Vulkan

  • Try to keep the number of descriptor sets in pipeline layouts as low as possible.
  • Use dynamic uniform and storage buffers for per-draw call changes.
  • Prefer using combined image and sampler descriptors.
  • Vulkan 1.2 enables passing device addresses of storage buffers as 64-bit values to shaders. This enables pointer-like workflows (such as casting) that are not available in DirectX or HLSL. GLSL exposes this through GL_EXT_buffer_reference(2) and uses SPV_EXT_physical_storage_buffer. Try to make optimal usage of the buffer_reference_align information, as the hardware can leverage wider memory load operations accordingly.
  • Do not exceed 1M active descriptors and 2K samplers in total for the whole application (GPU-visible).
    • Otherwise, pipeline stalls across the whole GPU could occur when switching descriptor heaps (DirectX 12).
    • Whenever the limits are exceeded, it reduces the asynchronous execution efficiency of command lists.
    • On Vulkan, the deduplication of descriptors is automatically performed by the driver. The limits mentioned earlier only count towards unique variations.
      • In general, try to keep under the thresholds described in VkPhysicalDeviceLimits.
  • Avoid typed UAV loads or stores where possible.

DirectX 12

  • Prevent the excessive creation or copying of descriptors during the frame.
    • Keep descriptors around persistently instead of re-allocating or copying them in each frame.
    • Use root CBVs instead of CBVs in a descriptor table.
      • There’s no need to call CreateConstantBufferView with a root CBV.
    • The careful selection of smaller descriptor tables could also improve the situation.
  • Reduce duplicate descriptors to the same resources as much as possible.
    • Example: Texture 0 should not be referenced in Descriptor 0, 10, 20, 30, 40, 50, and so on.
    • Instead, try changing the layout from the descriptor tables to be able to reuse the same descriptor multiple times.

Vulkan

  • Do not have excessively sparse binding offsets in a single descriptor set.
    • Keep bindings as tightly packed as possible.
    • Unused binding indices waste memory and reduce cache efficiency.
Discuss (0)

Tags