Simulation / Modeling / Design

Advanced API Performance: Debugging

A graphic of a computer sending code to multiple stacks.

NVIDIA offers a large suite of tools for graphics debugging, including NVIDIA Nsight System for CPU debugging, and Nsight Graphics for GPU debugging. Nsight Aftermath is useful for analyzing crash dumps. 

  • Always check the validation layers and make sure they don’t output any errors.
  • Use Nsight Aftermath for detailed DirectX 12 or Vulkan GPU exception debugging.
    • Using the Nsight Aftermath Monitor is an easy way to get started without the need for code integration. 
    • For even more control over the GPU crash dump functionality, consider using the Aftermath SDK to integrate the crash dump capabilities into your own code.
      • Crash dump generation can always be enabled since there is no associated runtime cost.  As noted later, debug checkpoints can have measurable CPU overhead and should not be used in shipping applications.
      • You can use the callbacks provided in the API to save the crash dumps to the local disk or push them to the cloud.
    • For more information, see the Nsight Aftermath SDK product page.
  • Isolate problems using debug checkpoints.
    • These APIs enable inserting checkpoints in the GPU command stream, making it possible to narrow down crashes to certain subsections of the command stream.
    • Use the API supported by Nsight Aftermath. For more information and samples, see the /NVIDIA/nsight-aftermath-samples GitHub repo.
    • Alternatively, use the DirectX 12 cross-vendor solution:
      • Use ID3D12GraphicsCommandList2::WriteBufferImmediate or DRED
      • It isn’t supported in conjunction with Nsight Aftermath, so it is better to avoid mixing these.
    • Add as a runtime flag in the engine to be able to toggle this functionality.
      • These markers are far from free and have a runtime cost associated with them (serializing GPU, CPU call overhead, and the time to capture call-stacks).
      • Keeping them enabled by default could have a severe performance cost.
      • See disadvantages in the Not recommended section.
  • Build shaders with debug info.
    • Compile with /Zi to embed debug info into the shader binary.
    • This is helpful when using debugging tools like NVIDIA Nsight Graphics.
    • Also, Nsight Aftermath could give you source-level GPU crash info using this information.
  • Use Nsight Aftermath crash dumps to identify the type of error that occurs during a crash.
    • Device hung:
      • These can occur due to a single command list taking longer than a few seconds to execute.
      • Microsoft Windows terminates the driver after a few seconds of no apparent feedback from the driver and GPU (TDR).
      • This can also happen in case of extreme workloads (massive pixel overdraw, or degraded ray tracing acceleration structures).
    • Page faults
      • These are caused by invalid memory accesses: either an out-of-bounds read/write or a resource that is not valid anymore.
    • For more information, see How to Set Up and Inspect GPU Crash Dumps.
  • Generic debugging advice for graphics and compute-related problems:
    • Check whether all referenced memory is valid and correct at all times when the GPU accesses the data.
    • Check whether descriptors point to the right resources, which are fully allocated and initialized.
    • Check whether data reads and writes are not going out of bounds.
    • Use debug checkpoints and GPU crash dump functionality from Nsight Aftermath to narrow down the location of the crash.
    • DirectX 12: The debug layer could help here, but for these problems, GPU-based validation must be enabled, which generally makes the application run extremely slow with complex scenes. It could still be useful for unit or regression testing.
  • Generic debugging advice for NVIDIA RTX-related problems:
    • Check whether the input vertex or index data are all valid.
      • Invalid indices could crash the GPU builder kernel. 
      • Invalid vertices could affect the acceleration structures and make performance extremely slow.
    • Degenerate triangles or tricks to disconnect triangles that work in a rasterizer do not work as intended in an acceleration tree and can cause big problems. Check if such tricks are not being employed, for example, to disconnect or delete geometry. Exclusively use valid geometry instead.
    • Check whether memory is all still valid at the moment data is being referenced by builder or ray tracing kernels.
    • Check whether all textures and buffers used by ray tracing kernels are all valid.
    • Check whether descriptors are correct and the shader binding tables are valid.
    • Debug checkpoints are not useful for debugging ray tracing workloads because of the indirections happening inside the RT kernels potentially touching thousands of shader permutations. At most, it can tell you if the crash happened in the builder or the ray-tracing kernel.
    • Instead, use Nsight Aftermath crash debugging to get an approximate idea of the crash:
      • Page fault: Memory-related issue, usually out-of-bounds read-write or trying to access a resource that was removed or not copied to the GPU yet.
      • GPU hang (TDR): Infinite loops, too complex shading, or too many rays.
    • Having a way to simplify code and binding requirements could be useful, like disabling textures or reducing shader permutations.
      • For example, have a debug view showing barycentrics only (no shader binding table requirements).
        • Check for broken-looking geometry or spots where performance gets severely degraded due to corrupted triangle data.
        • Visually verify output from dynamic sources, such as deformed geometry or skinned meshes in a ray tracing-only view.
      • Being able to fully disable dynamic geometry can help to isolate these kinds of issues as well.
  • Simplify debugging by adding flags in the application to the following:
    • Serialize the GPU/CPU at the queue level.
    • Serialize the GPU/CPU at the command list level.
    • Disable async compute.
    • Disable async copies.
    • Add full barriers between compute, dispatch, and copy calls in the command lists (NULL UAV/memory barrier).
    • Do anything else you can to remove parallelism. It’s much harder to debug and pinpoint where a problem comes from when the GPU is running multiple workloads at the same time.
    • Don’t keep any of these suggestions enabled by default. They should be strictly debug-only flags. Reducing parallelism significantly degrades performance.
  • The use of excessive debug checkpoints.
    • They have a non-negligible CPU and GPU performance cost.
    • Use them sparingly. Aim for ~100 per frame, preferably less.
    • Best not to use them at all for the end user (developer or QA only), or enable them when a GPU hang has been detected. You could also make them an option to toggle by the end user.
  • Assuming that a CPU call stack will tell you anything about a GPU problem.
    • Crashes with a call stack pointing to the driver usually manifest as a random graphics API call failing due to an internal device lost event.
    • Use the Nsight Aftermath crash dump or debug checkpoints to pinpoint where the fault occurs.
  • Testing on a single machine (excluding the effect of bad hardware).
    • A corrupted memory (CPU or GPU), overclocking, and bad cooling can all contribute to random faults. Nsight Aftermath has no way of differentiating these from valid errors.
    • A telltale sign could be that crashes happen randomly without any pattern across the GPU on a single machine but not another that has similar specifications.
    • Try to validate results on more than one machine with similar hardware, software, and driver versions.
  • Permitting users to run with extremely outdated NVIDIA drivers
    • Outdated drivers can have unexpected behaviors and are harder to get reliable crash dumps from.
    • Find a driver version that works reliably. Show a popup that says the driver is out of date when it is earlier than that version. Don’t stop users from running the application or game, but discourage them from doing so as it can cause system instability.

Acknowledgments

Thanks to Patrick Neill, Jeffrey Kiel, Justin Kim, Andrew Allan, and Louis Bavoil for their help with this post.

Discuss (0)

Tags