NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit
The NVIDIA CUDA Profiling Tools Interface (CUPTI) is a library that enables the creation of profiling and tracing tools that target CUDA applications. CUPTI provides a set of APIs targeted at ISVs creating profilers and other performance optimization tools:
- the Activity API,
- the Callback API,
- the Host Profiling API,
- the Range Profiling API,
- the PC Sampling API,
- the SASS Metric API,
- the PM Sampling API,
- the Checkpoint API,
- the Profiling API,
- the Event API,
- the Metric API, and
- the Python API (available separately)
Using these CUPTI APIs, independent software developers can create profiling tools that provide low and deterministic profiling overhead on the target system, while giving insight into the CPU and GPU behavior of CUDA applications. Normally packaged with the CUDA Toolkit, NVIDIA occasionally uses this page to provide CUPTI improvements and bug fixes between toolkit releases.
Revision History
Key Features
- Trace CUDA API by registering callbacks for API calls of interest
- Full support for entry and exit points in the CUDA C Runtime (CUDART) and CUDA Driver
- GPU workload trace for the activities happening on the GPU, which includes kernel executions, memory operations (e.g., Host-to-Device memory copies) and memset operations.
- CUDA Unified Memory trace for transfers from host to device, device to host, device to device and page faults on CPU and GPU etc.
- Normalized timestamps for CPU and GPU trace
- Profile hardware and software event counters, including:
- Utilization metrics for various hardware units
- Instruction count and throughput
- Memory load/store events and throughput
- Cache hits/misses
- Branches and divergent branches
- Many more
- Enables automated bottleneck identification based on metrics such as instruction throughput, memory throughput, and more
- Range profiling to enable metric collection over concurrent kernel launches within a range
- Metrics attribution at the high-level source code and the executed assembly instructions.
- Device-wide sampling of the program counter (PC). The PC Sampling gives the number of samples for each source and assembly line with various stall reasons.
Updates in CUDA Toolkit 13.0
- Added support for NVLOG, a configurable logging and error-reporting library, enabling detailed logging with various severity levels (Info, Warning, Error, Fatal) and output options including files and stdout. For more details, refer to the section Logging Using NVLOG.
- Introduced a new subscriber API,
cuptiSubscribe_v2
, which includes additional parameters for the current and previous subscriber s names (if applicable). This enhancement allows for easier identification of the previous subscriber in cases where the current subscription fails. - Added new callbacks
CUPTI_CBID_RESOURCE_GRAPH_NODE_UPDATED
for updates to CUDA Graph node andCUPTI_CBID_RESOURCE_GRAPH_NODE_SET_PARAMS
for when Graph parameters are set via thecuGraphExecNodeSetParams()
API. - A new field
isDeviceLaunched
is added in the activity records for kernel, memcpy and memset operations to indicate whether operation is a part of the device launched graph. To accommodate this change, activity recordCUpti_ActivityKernel9
is deprecated and replaced by a new activity recordCUpti_ActivityKernel10
. With this change, collection of records for device launched graph is enabled by default if HES trace is enabled. - A new field,
isManagedPool
, has been added to the memory pool record to indicate whether the pool uses managed memory or pinned memory allocation. TheCUpti_ActivityMemoryPool2
activity record is deprecated and replaced by the newCUpti_ActivityMemoryPool3
activity record. - Initial support for parsing and decoding NVTX extended payloads via activity records. This enables tools to extract extended payload information from NVTX activity records. Libraries such as NCCL, which emit NVTX annotations using extended payloads, are now decodable via CUPTI. For more information, refer to the section NVIDIA Tools Extension (NVTX) Support.
- Added an API
cuptiActivityEnableCudaEventDeviceTimestamps
to control collection of device timestamp for theCUPTI_ACTIVITY_KIND_CUDA_EVENT
record. By default, the collection of CUDA event device timestamps is disabled. - Starting with CUDA Toolkit 13.0, CUPTI API versions follow the format xxyyzz, where:
- xx : Major CUDA Toolkit version
- yy : Minor CUDA Toolkit version
- zz : CUPTI-specific patch or update version (Note: This does not necessarily align with CUDA Toolkit update versions)
- For example, version 130000 indicates CUDA Toolkit 13.0.
- Ported the following samples from Profiling APIs to the new Range Profiler APIs: cupti_metric_properties, profiling_injection, callback_profiling and concurrent_profiling.
- Added a new error code
CUPTI_ERROR_INVALID_CHIP_NAME
for invalid chip name passed in the Profiler Host APIs. - The CUPTI Profiling API from the header
cupti_profiler_target.h
and the Perfworks Metric API from the headernvperf_host.h
are deprecated in the CUDA 13.0 release and will be removed in a future CUDA release. It is recommended to use the CUPTI Range Profiling API as an alternative. For more information, refer to the section CUPTI Range Profiling API as an alternative. - The source/SASS level metrics from the header
cupti_activity.h
are dropped. It is recommended to move to the SASS Metric API from the headercupti_sass_metrics.h
. - Removed support for the PowerPC (ppc64le) architecture.
- The PC Sampling Activity API from the header
cupti_activity.h
is dropped. It is recommended to move to the PC Sampling API from the headercupti_pcsampling.h
. - By default, CUPTI will request activity buffers separately for each thread that generates activity records. This will improve the runtime performance of the application as it avoids contention when multiple threads are generating activities concurrently.
- Reduction in the tracing collection overhead for graph launches under HES trace.
- Fixed the issue that kernel activity records were showing stale information after updating the params through
cuGraphExecKernelNodeSetParams()
. - Fixed the hang which can occur when tracing cooperative kernels under MPS. This requires an NVIDIA display driver version of 580 or higher.
- Fixed an issue where some public API symbols were set to hidden visibility in the static build of the CUPTI library.
- On a WSL2 system, using CLOCK_MONOTONIC_RAW instead of CLOCK_REALTIME because the latter can cause backward jumps due to system time adjustments.
New Features
Deprecated and dropped features
Resolved Issues
Requirements
- Linux x86_64[1]
- Windows x86_64[1]
- Linux aarch64 SBSA[1]
- DRIVE OS QNX aarch64[2]
- DRIVE OS Linux aarch64[2]
- Activity and Callback APIs
- All architectures supported by CUDA Toolkit
- Profiling and PC Sampling APIs
- Blackwell: B100, GB10x, GB11x
- Hopper: GH100
- Ada: AD10x
- Ampere: A100 with Multi-Instance GPU, GA10x
- Turing
- CUPTI can be found in the CUDA Toolkit 13.0 production release
- 580.88 (Windows) available at the NVIDIA Driver Download page.
- 580.65.06 (Linux) provided with CUDA Toolkit 13.0 production release.
Supported platforms
[2] available in the Embedded or Drive toolkits only
Supported NVIDIA GPU architectures
CUDA Toolkit
Drivers
-
Please use the following drivers
Documentation
Support
To provide feedback, request additional features, or report issues, please use the Developer Forums.
Installation Overview
When installing CUDA Toolkit 13.0 and specifying options, be sure to select CUDA > Development > Tools > CUPTI.