Advanced API Performance: Variable Rate Shading

This post covers best practices for variable rate shading on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Variable rate shading (VRS) is a graphics feature allowing applications to control the frequency of pixel shader invocations independent of the resolution of the render target. It is available in both D3D12 and Vulkan.

Defining the VRS rate

Depending on your API of choice, you may have up to three options for defining the VRS rate:

A per-draw call rate
A per-primitive rate
A lookup from a screen-space shading rate image that can define different rates in different regions of the screen

Per-draw call

This option is the simplest to implement, and only requires a few extra API calls in the command stream with no additional dependencies. It is the coarsest granularity, so think of it as the “broadest brush”

Per-primitive

This option requires augmenting geometric assets, and thus will likely need changes in how the assets are generated in the art pipeline or loaded and prepared by the application. Use the knowledge of what you are drawing to finely tailor the shading rate to your needs.

Screen-space lookup

This option requires rendering pipeline changes, but generally no asset or art changes.

The most difficult/interesting question is how to generate the shading rate image:

NVIDIA Adaptive Shading (NAS), for example, uses motion vector data plus color variance data to determine what parts of the image need less detail.
To be useful, the performance cost of the algorithm to generate the shading rate image must be less than the performance savings of VRS.
Try to use inexpensive algorithms on data that your rendering engine already generates (for example, motion vectors or previous frame results).

Continue using existing techniques for graphics optimization. VRS does not fundamentally change much about the usage of the graphics pipeline. Existing advice about graphics rendering continues to apply. The main difference in operation under VRS is that the relative amount of Pixel Shader workloads may be smaller than when rendering normally.
Look to cross-apply techniques and considerations from working with multisample anti-aliasing (MSAA). VRS operates on similar principles to MSAA, and as such, it has limited utility in deferred renderers. VRS cannot help with improving the performance of compute passes.

Not recommended

Avoid 4×4 mode in cases where warp occupancy is the limiting factor.
- Generally speaking, pixel shading workloads scale linearly with the variable shading mode. For example, 1×2/2×1 mode has ½ as many PS invocations, 2×2 mode has ¼ as many, and so on.
- The one exception is 4×4 mode, which would ideally have 1/16 pixel shader invocations. However, in 4×4 mode, the rasterizer cannot span the whole pixel range needed to generate a full 32 thread warp all at once. As a result, warps in 4×4 mode are only half active (16 threads instead of 32).
- If warp occupancy is a limiting factor, this means that 4×4 mode may not have any performance benefit over 4×2/2×4 mode, since the total number of warps is the same.
Use caution and check performance when using VRS with blend-heavy techniques.
- Although VRS may decrease the number of running pixel shaders, blending still runs at full rate, that is, once for every individual sample in the render target. Semi-transparent volumetric effects are a potential candidate for VRS due to often being visually low-frequency, but if the workload is already ROP-limited (blending), using VRS does not change that limitation and thus may not result in any appreciable performance improvement.
Use care with centroid sampling! Centroid sample selection across a “coarse pixel” may work in counter-intuitive ways! Make sure that you are familiar with your API’s specifications.
Do not modify the output depth value from the pixel shader. If the pixel shader modifies depth, VRS is automatically disabled.