Advanced API Performance: Shaders

This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Shaders play a critical role in graphics programming by enabling you to control various aspects of the rendering process. They run on the GPU and are responsible for manipulating vertices, pixels, and other data.

General shaders
Compute shaders
Pixel shaders
Vertex shaders
Geometry, domain, and hull shaders

General shaders

These tips apply to all types of shaders.

Avoid warp-divergent constant buffer view (CBV) and immediate constant buffer (ICB) reads.
- Constant buffer reads are most effective when threads in a warp access data uniformly. If you need divergent reads, use shader resource view (SRVs).
- Typical cases where SRVs should be preferred over CBVs include the following:
  - Bones or skinning data
  - Lookup tables, like precomputed random numbers
To optimize buffers and group shared memory, use manual bit packing. When creating structures for packing data, consider the range of values a field can hold and choose the smallest datatype that can encompass this range.
Optimize control flow by providing hints of the expected runtime behavior.
- Make sure to enable compile flag -all-resources-bound for DXC (or D3DCOMPILE_ALL_RESOURCES_BOUND in FXC) if possible. This enables a larger set of driver-side optimizations.
- Consider using the [FLATTEN] and [BRANCH] keywords where appropriate.
  - A conditional branch may prevent the compiler from hoisting long-latency instructions, such as texture fetches.
  - The [FLATTEN] keyword hints that the compiler is free to hoist and start the load operations before the statement has been evaluated.
Use Root Signature 1.1 to specify static data and descriptors to enable the driver to make the most optimal shader optimizations.
Keep the register use to a minimum. Register allocation could limit occupancy and may force the driver to spill registers to memory.
Prefer the use of gather instructions when loading single channel texture quads.
- This will cut down the expected latency by almost 4x compared to the equivalent operation constructed from consecutive sample instructions.
Prefer structured buffers over raw buffers.
- Structured buffers have stricter alignment requirements, which enables the driver to schedule more efficient load instructions.
Consider using numerical approximations or precomputed lookup tables of transcendental functions (exp, log, sin, cos, sqrt) in math-intensive shaders, for instance, physics simulations and denoisers.
To promote a fast path in the TEX unit, with up to 2x speedup, use point filtering in certain circumstances:
- Low-resolution textures where point filtering is already an accurate representation.
- Textures that are being accessed at their native resolution.

Not recommended

Don’t assume that half-precision floats are always faster than full precision and the reverse.
- On NVIDIA Ampere GPUs, it’s just as efficient to execute FP32 as FP16 instructions. The overhead of converting between precision formats may just end up with a net loss.
- NVIDIA Turing GPUs may benefit from using FP16 math, as FP16 can be issued at twice the rate of FP32.

Compute shaders

Compute shaders are used for general-purpose computations, from data processing and simulations to machine learning.

Consider using wave intrinsics over group shared memory when possible for communication across threads.
- Wave intrinsics don’t require explicit thread synchronization.
- Starting from SM 6.0, HLSL supports warp-wide wave intrinsics natively without the need for vendor-specific HLSL extensions. Consider using vendor-specific APIs only when the expected functionality is missing. For more information, see Unlocking GPU Intrinsics in HLSL.
- To increase atomic throughput, use wave instructions to coalesce atomic operations across a warp.
To maximize cache locality and to improve L1 and L2 hit rate, try thread group ID swizzling for full-screen compute passes.
A good starting point is to target a thread group size corresponding to between two or eight warps. For instance, thread group size 8x8x1 or 16x16x1 for full-screen passes. Make sure to profile your shader and tune the dimensions based on profiling results.

Not recommended

Do not make your thread group size difficult to scale per platform and GPU architecture.
- Specialization constants can be used in Vulkan to set the dimensions at pipeline creation time whereas HLSL requires the thread group size to be known at shader compile time.
Be careless of thread group launch latency.
- If your CS has early-out conditions that are expected to early out in most cases, it might be better to choose larger thread group dimensions and cut down on the total number of thread groups launched.

Pixel shaders

Pixel shaders, also known as fragment shaders, are used to calculate effects on a per-pixel basis.

Prefer the use of depth bounds test or stencil and depth testing over manual depth tests in pixel shaders.
Depth and stencil tests may discard entire 16×16 raster tiles down to individual pixels. Make sure that Early-Z is enabled.
Be mindful of the use patterns that may force the driver to disable Early-Z testing:
- Conditional z-writes such as clip and discard
  - As an alternative consider using null blend ops instead
- Pixel shader depth write
- Writing to UAV resources
Consider converting your full screen pass to a compute shader if there’s a large difference in latency between warps.

Not recommended

Don’t use raster order view (ROV) techniques pervasively.
- Guaranteeing order doesn’t come for free.
- Always compare with alternative approaches like advanced blending ops and atomics.

Vertex shaders

Vertex shaders are used to calculate effects on a per-vertex basis.

Geometry, domain, and hull shaders

Geometry, domain, and hull shaders are used to control, evaluate, and generate geometry, enabling tessellation to create a dynamic generation of surfaces and objects.

Replace the geometry, domain, and hull shaders with the mesh shading capabilities introduced in NVIDIA Turing.
Enable the fast geometry path with the following configuration:
- Fixed topology: Neither an expansion or reduction in the number of vertices.
- Fixed primitive type: The input primitive type is equal to the output primitive type.
- Immutable per-vertex attributes: The application cannot change the vertex attributes and can only copy them from the input to the output.
- Mutable per-primitive attributes: The application can compute a single value for the whole primitive, which then is passed to the fragment shader stage. For example, it can compute the area of the triangle.

Acknowledgments

Thanks to Ryan Prescott, Ana Mihut, Katherine Sun, and Ivan Fedorov.

General shaders

Recommended

Not recommended

Compute shaders

Recommended

Not recommended

Pixel shaders

Recommended

Not recommended

Vertex shaders

Recommended

Geometry, domain, and hull shaders

Recommended

Acknowledgments

Related resources

Tags

About the Authors

Advanced API Performance: Shaders

General shaders

Recommended

Not recommended

Compute shaders

Recommended

Not recommended

Pixel shaders

Recommended

Not recommended

Vertex shaders

Recommended

Geometry, domain, and hull shaders

Recommended

Acknowledgments

Related resources

Tags

About the Authors

Comments

Related posts

Advanced API Performance: Pipeline State Objects

Advanced API Performance: CPUs

Advanced API Performance: Mesh Shaders

Advanced API Performance: Memory and Resources

Advanced API Performance: Async Compute and Overlap

Related posts

Advanced API Performance: Descriptors

Accelerated Motion Processing Brought to Vulkan with the NVIDIA Optical Flow SDK

GPU-Accelerated Video Processing with NVIDIA In-Depth Support for Vulkan Video

Performance Boosts and Enhanced Features in New Nsight Graphics, Nsight Aftermath Releases

Advanced API Performance: Vulkan Clearing and Presenting