PerfWorks has now been replaced with an updated C++ API and a new name - NVIDIA® Nsight™ Perf SDK. You can find more information about it and download from here. The new NSight Perf SDK supports NVIDIA GPU Architectures from Volta through the latest Ampere chips. The API is easy to use and integrate into your application.


PerfWorks is a C++ API used for GPU performance analysis on NVIDIA GPUs. PerfWorks allows developers the ability to instrument an application and to access low-level performance metrics on NVIDIA GPUs. PerfWorks delivers these metrics in order to give developers the ability to recognize top performance limiters quickly and make appropriate application changes to remove the associated bottlenecks. These metrics can be collected over user ranges, draw calls or dispatches. There are four metric categories including: cumulative work, timing, activity and throughput.

  • Cumulative work includes work achieved over time such as number of shaded pixels
  • Timing refers to time duration or average clock rate calculations that make it easier to understand a scenario.
  • Activity tells you where the GPU is stalled and where the GPU is active.
  • Throughput refers to the rate of operations e.g. the number of instructions executed.

PerfWorks is the successor to NVIDIA’s Perfkit. PerfWorks adds range based profiling and it supports next generation APIs featuring multi-threaded GPU work submission. GPU generations supported by PerfWorks includes Maxwell, Pascal and future generations when available. PerfWorks is used by NVIDIA internal tools including: Tegra Graphics Debugger, Nsight Visual Studio Edition and other future products.

Key Features

  • Support for collecting Graphics Metrics. See Figure 1 below.
  • Figure 1

  • Support for collecting Compute Metrics. See Figure 2 below.
    • These Compute Metrics can be Compute-Bound, Memory Bound or Latency-Bound.

    Figure 2

  • Range Based Profiling
    • Other tools profile one kernel or draw-call at a time. However, with PerfWorks a developer can profile them as a range therefore allowing for inherent parallelism of execution.
  • Multi-Pass Profiling
    • The hardware has a limited number of physical counters. To collect more than the physical limit, PerfWorks requires the application to deterministically replay the GPU work multiple times. During each replay the application must make the same GPU calls with the same range delimiters and a different set of counters is collected.
  • D3D12 Support
    • Supports multithreaded GPU work submission. D3D12 is a different way of coding than CUDA or OpenGL.
  • Support for nvperf
    • nvperf is a command line tool for offline querying of PerfWorks metrics