Content Creation / Rendering

Powerful Shader Insights: Using Shader Debug Info with NVIDIA Nsight Graphics

As ray tracing becomes the predominant rendering technique in modern game engines, a single GPU RayGen shader can now perform most of the light simulation of a frame. To manage this level of complexity, it becomes necessary to observe a decomposition of shader performance at the HLSL or GLSL source-code level. As a result, shader profilers are now a must-have tool for optimizing ray tracing.

In this post, I’ll show you how to use the NVIDIA Nsight Graphics GPU Trace Profiler to analyze low-level shader performance, and why it’s important to enable the DirectX Compiler (DXC) debugging information option (-Zi) during shader compilation. Throughout the post, I’ll use:

GPU Trace without shader debugging information

The Path Tracing SDK sample compiles all of its shaders with embedded shader debugging information (-Zi) as a DXC command-line option within the CMake file. To simulate what would happen without using -Zi, I removed -Zi in the CMake file and did a rebuild in the Visual Studio solution.

I also generated a GPU Trace from the modified Path Tracing SDK demo app, with the default settings (the Real-Time Shader Profiler enabled and locked GPU clocks). Figure 1 shows how this looks in the tool with the PathTrace marker selected.

Screenshot of the Nsight Graphics GPU Trace Profiler, with metric graphs on top and the real-time shader profiler panel on the bottom right.
Figure 1. The NVIDIA Nsight Graphics GPU Trace default view without shader debugging information

The hot spots don’t show any HLSL, but only DXIL links. Clicking the DXIL link for the top hot spot, the tool jumps to a line of DXIL and shows Sample% and Stall reason%s for that line. It’s challenging to know what line of HLSL this line maps to (Figure 2).

Screenshot of GPU Trace without shader debugging information.
Figure 2. Without shader debugging information, only DXIL is available in the Shader Source section

Looking at the Shader Pipelines tab in the bottom panel (Figure 3), notice that the Correlation column contains the warning, “Shader bytecode doesn’t contain debug info, recompile the shader with debug info enabled to solve the issue.”

creenshot with tooltip that reads, “Shader bytecode does not contain debugging information. Recompile the shader with debug info enabled to solve this issue.”
Figure 3. Alerts about missing shader debugging information in the Shader Pipelines section

The GPU Trace Profiler gathers its shader profiling data at the SASS level, which is the low-level instruction set that executes on the GPU. It can always correlate the SASS-level profiling data to the DXIL (or SPIR-V) intermediate language. 

However, for the GPU Trace Profiler to correlate the profiling data to the high-level language (HLSL or GLSL), the DXIL must contain shader debugging information. Specifically, it must contain HLSL source code and line tables that correlate HLSL lines to DXIL lines. See the Shader Compilation section of the NVIDIA Nsight Graphics User Guide for how to enable shader debugging information depending on how the HLSL is compiled. 

During development, I recommend embedding the shader debugging information into the DXIL blobs. In other words, use -Zi -Qembed_debug instead of shader PDBs, so that your GPU Trace files are standalone. For details, see microsoft/DirectXShaderCompiler on GitHub.

With this approach, you don’t need to manually keep track of which shader PDBs go with which trace. If embedding the shader debugging information makes the DXIL blobs too large or causes issues, you can also configure DXC to write the shader debugging information as separate shader PDBs files into a folder using the -Zi and -Fd DXC options. In this case, you’ll need to configure the search path for Separate Shader Debug Info in the Nsight Graphics Options before loading the GPU Trace file.

GPU Trace with shader debugging information

Figure 4 shows the Shader Pipelines section after rebuilding the visual studio solution of the demo app with the original DXC command line arguments, -Zi -Qembed_debug, for the ray-tracing shaders, and having captured a new trace with the GPU Trace Profiler.

Screenshot showing that the Correlation column now contains HLSL/DXIL green icons for all shaders, except for the unknown samples.
Figure 4. All shaders have HLSL correlation after after enabling -Zi in the Shader Pipelines section 

By clicking on the Bottom-Up Calls tab, you can now easily see HLSL functions that are expensive, what type of GPU work they do, and what is limiting them at a high level (Figure 5). Note that the Bottom-Up Calls tab is absent when no debugging symbols are loaded, along with the Top-Down Calls tab.

Screenshot showing HLSL function names along with their corresponding filenames and line numbers, and their percentage of total samples and issue stall reasons.
Figure 5. The Bottom-Up Calls at the HLSL level

In this example, the third HLSL function in the Bottom-Up Calls section is a random number generator, bhos_sobol, taking 4.5% of the PathTrace marker, and with top issue stall NOINST (waiting on an instruction cache miss, typically) for 60% of the HLSL function. 

In Figure 6, the HLSL view shows that the body of the unrolled loop is where all of the latency is produced, and the top SM issue stall reason is NOINST. This can be explained by the presence of an unrolled loop with many iterations, which produces many hardware instructions and puts pressure on the instruction caches of the GPU.

Screenshot of HLSL implementation of the hbos_sobol function.
Figure 6. The profiled HLSL for the bhos_sobol function

As the instruction cache is limited, one possible optimization strategy for this function is to precompute its output values in a structured buffer. SDK release v1.2.0 already supports this path, although it is not enabled by default.

// In Config.h
#define  USE_PRECOMPUTED_SOBOL_BUFFER       1               // see NoiseAndSequences.hlsli - still experimental, faster but lower quality and more RAM - not a clear win

// In NoiseAndSequences.hlsli
uint bhos_sobol(uint index, uniform uint dimension)
{
    return SOBOL_PRECOMPUTED_BUFFER[ (index % SOBOL_PRECOMPUTED_INDEX_COUNT) + SOBOL_PRECOMPUTED_INDEX_COUNT * dimension ];
}

Conclusion

When using the Real-Time Shader Profiler in NVIDIA Nsight Graphics GPU Trace, I recommend verifying that all of your shaders have HLSL Correlation enabled in the Shader Pipelines section of the bottom-right panel. If you see shaders with significant Samples% missing HLSL correlation, first recompile these shaders with debugging information enabled, as documented in the NVIDIA Nsight Graphics User Guide. Then capture a new GPU Trace. 

To continue trying to use the shader profiler data without HLSL correlation would be a waste of time, compared to re-running with HLSL correlation enabled. HLSL Debug Data is also important for features like the Nsight Graphics Shader Debugger, enabling you to properly debug your shader code, as well as for NVIDIA Nsight Aftermath.

Acknowledgments

For their contributions to this post, I’d  like to thank Aurelio Reis, Avinash Baliga, Axel Mamode, Robert Jensen, Iain Cantlay, Filip Strugar, Ivan Fedorov, and Juha Sjoholm.

Discuss (0)

Tags