NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit

The NVIDIA CUDA Profiling Tools Interface (CUPTI) is a library that enables the creation of profiling and tracing tools that target CUDA applications. CUPTI provides a set of APIs targeted at ISVs creating profilers and other performance optimization tools:

the Activity API,
the Callback API,
the Host Profiling API,
the Range Profiling API,
the PC Sampling API,
the SASS Metric API,
the PM Sampling API,
the Checkpoint API,
the Profiling API,
the Event API,
the Metric API, and
the Python API (available separately)

Using these CUPTI APIs, independent software developers can create profiling tools that provide low and deterministic profiling overhead on the target system, while giving insight into the CPU and GPU behavior of CUDA applications. Normally packaged with the CUDA Toolkit, NVIDIA occasionally uses this page to provide CUPTI improvements and bug fixes between toolkit releases.

There is currently no CUPTI update to the CUDA Toolkit 13.0. You may obtain the latest version of CUPTI by Downloading the CUDA Toolkit 13.0.0

Download the CUDA Toolkit 13.0 Now
Download the CUPTI Python API 12.8.0 Now

Revision History

Key Features

Trace CUDA API by registering callbacks for API calls of interest

Full support for entry and exit points in the CUDA C Runtime (CUDART) and CUDA Driver

GPU workload trace for the activities happening on the GPU, which includes kernel executions, memory operations (e.g., Host-to-Device memory copies) and memset operations.
CUDA Unified Memory trace for transfers from host to device, device to host, device to device and page faults on CPU and GPU etc.
Normalized timestamps for CPU and GPU trace
Profile hardware and software event counters, including:

Utilization metrics for various hardware units
Instruction count and throughput
Memory load/store events and throughput
Cache hits/misses
Branches and divergent branches
Many more

Enables automated bottleneck identification based on metrics such as instruction throughput, memory throughput, and more
Range profiling to enable metric collection over concurrent kernel launches within a range
Metrics attribution at the high-level source code and the executed assembly instructions.
Device-wide sampling of the program counter (PC). The PC Sampling gives the number of samples for each source and assembly line with various stall reasons.

See the CUPTI User Guide for a complete listing of hardware and software event counters available for performance analysis tools.

Updates in CUDA Toolkit 13.0

New Features

Added support for NVLOG, a configurable logging and error-reporting library, enabling detailed logging with various severity levels (Info, Warning, Error, Fatal) and output options including files and stdout. For more details, refer to the section Logging Using NVLOG.
Introduced a new subscriber API, cuptiSubscribe_v2, which includes additional parameters for the current and previous subscriber s names (if applicable). This enhancement allows for easier identification of the previous subscriber in cases where the current subscription fails.
Added new callbacks CUPTI_CBID_RESOURCE_GRAPH_NODE_UPDATED for updates to CUDA Graph node and CUPTI_CBID_RESOURCE_GRAPH_NODE_SET_PARAMS for when Graph parameters are set via the cuGraphExecNodeSetParams() API.
A new field isDeviceLaunched is added in the activity records for kernel, memcpy and memset operations to indicate whether operation is a part of the device launched graph. To accommodate this change, activity record CUpti_ActivityKernel9 is deprecated and replaced by a new activity record CUpti_ActivityKernel10. With this change, collection of records for device launched graph is enabled by default if HES trace is enabled.
A new field, isManagedPool, has been added to the memory pool record to indicate whether the pool uses managed memory or pinned memory allocation. The CUpti_ActivityMemoryPool2 activity record is deprecated and replaced by the new CUpti_ActivityMemoryPool3 activity record.
Initial support for parsing and decoding NVTX extended payloads via activity records. This enables tools to extract extended payload information from NVTX activity records. Libraries such as NCCL, which emit NVTX annotations using extended payloads, are now decodable via CUPTI. For more information, refer to the section NVIDIA Tools Extension (NVTX) Support.
Added an API cuptiActivityEnableCudaEventDeviceTimestamps to control collection of device timestamp for the CUPTI_ACTIVITY_KIND_CUDA_EVENT record. By default, the collection of CUDA event device timestamps is disabled.
Starting with CUDA Toolkit 13.0, CUPTI API versions follow the format xxyyzz, where:

xx : Major CUDA Toolkit version
yy : Minor CUDA Toolkit version
zz : CUPTI-specific patch or update version (Note: This does not necessarily align with CUDA Toolkit update versions)

For example, version 130000 indicates CUDA Toolkit 13.0.
Ported the following samples from Profiling APIs to the new Range Profiler APIs: cupti_metric_properties, profiling_injection, callback_profiling and concurrent_profiling.
Added a new error code CUPTI_ERROR_INVALID_CHIP_NAME for invalid chip name passed in the Profiler Host APIs.

Deprecated and dropped features

The CUPTI Profiling API from the header cupti_profiler_target.h and the Perfworks Metric API from the header nvperf_host.h are deprecated in the CUDA 13.0 release and will be removed in a future CUDA release. It is recommended to use the CUPTI Range Profiling API as an alternative. For more information, refer to the section CUPTI Range Profiling API as an alternative.
The source/SASS level metrics from the header cupti_activity.h are dropped. It is recommended to move to the SASS Metric API from the header cupti_sass_metrics.h.
Removed support for the PowerPC (ppc64le) architecture.
The PC Sampling Activity API from the header cupti_activity.h is dropped. It is recommended to move to the PC Sampling API from the header cupti_pcsampling.h.
By default, CUPTI will request activity buffers separately for each thread that generates activity records. This will improve the runtime performance of the application as it avoids contention when multiple threads are generating activities concurrently.

Resolved Issues

Reduction in the tracing collection overhead for graph launches under HES trace.
Fixed the issue that kernel activity records were showing stale information after updating the params through cuGraphExecKernelNodeSetParams().
Fixed the hang which can occur when tracing cooperative kernels under MPS. This requires an NVIDIA display driver version of 580 or higher.
Fixed an issue where some public API symbols were set to hidden visibility in the static build of the CUPTI library.
On a WSL2 system, using CLOCK_MONOTONIC_RAW instead of CLOCK_REALTIME because the latter can cause backward jumps due to system time adjustments.