Advanced Kernel Profiling with the Latest Nsight Compute

Discuss (0)
CUDA-X logo graphic

NVIDIA Nsight Compute is an interactive kernel profiler for CUDA applications. It provides detailed performance metrics and API debugging through a user interface and a command-line tool. Nsight Compute 2022.1 brings updates to improve data collection modes enabling new use cases and options for performance profiling. 

Download Now>>

What’s New

Range Replay

This release of Nsight Compute extends the existing replay modes with the  highly requested feature of Range Replay. Range Replay captures and replays complete ranges of CUDA API calls and kernel launches within the profiled application. Metrics are associated with the entire range as opposed to individual kernels.This allows the tool to execute kernels without serialization and support profiling kernels that need to be run concurrently for correctness or performance reasons. A range consists of a start and an end marker; and includes all CUDA API calls and kernels launched between these markers from any CPU thread.

Range markers can be defined using either:

  • Profiler Start/Stop API
  • NVTX Ranges

For complete details, see the “Replay” section in Nsight Compute’s Kernel Profiling Guide.

This is a flowchart diagram of how the Range Replay feature Nsight Compute captures a range of CUDA API calls and collects performance information to display for the profiled application.
Figure 1. Visualization of Range Replay: After capturing the range, each pass collects performance information for the entire range.

Memory Analysis

When profiling on A100, a new L2 Cache Eviction Policies table in the Memory Analysis section helps you understand the number of accesses and achieved hit rates by the various cache eviction policies. In the same section, the L2 Cache table now has a new ECC row to show traffic created from enabling hardware Error Correction Code on the GPU.

This is a screenshot of Nsight Compute showing tables with ECC and L2 cache eviction policy information using the Memory Workload Analysis feature.
Figure 2. Improvements to the Memory Workload Analysis tables: ECC and L2 cache eviction policy information.

Guided Analysis

Nsight Compute now makes it easier to select initial analysis targets in multiresult collection by dynamically selecting between the Summary and Details pages when opening a report. Rules were extended to detect non-fused floating-point instructions as an optimization opportunity. Last, but not least, when the Uncoalesced Memory Access rules are triggered, they show a table of the five most valuable instances, making it easier to inspect and resolve them on the Source page.

This figure is a screen capture Nsight Compute showing a summary page of multiresult reports, with the ability to sort and optimize the order of presentation.
Figure 3. Opening multiresult reports now shows the Summary page, allowing you to sort results and decide on the optimization order.
This is a screen capture of Nsight Compute displaying how the two Uncoalesced Memory Access rules are better presented in a concise and sorted format.
Figure 4. Both Uncoalesced Memory Access rules present results in a more concise and sorted format.

Additional improvements

Further improvements include an Occupancy Calculator auto-update. There is also a new ‘Thread Instructions Executed’ metric and register name tooltips for the Register Dependency columns in the Source page, as well as NVLink updates.

At GTC in November of 2021, we released insightful assets showcasing Nsight tools capabilities: