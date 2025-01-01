NVIDIA CUDA Profiling Tools Interface (CUPTI) - CUDA Toolkit 13.0

The NVIDIA CUDA Profiling Tools Interface (CUPTI) is a library that enables the creation of profiling and tracing tools that target CUDA applications. CUPTI provides a set of APIs targeted at ISVs creating profilers and other performance optimization tools:

the Activity API,

the Callback API,

the Host Profiling API,

the Range Profiling API,

the PC Sampling API,

the SASS Metric API,

the PM Sampling API,

the Checkpoint API,

the Profiling API,

the Python API (available separately)

Using these CUPTI APIs, independent software developers can create profiling tools that provide low and deterministic profiling overhead on the target system, while giving insight into the CPU and GPU behavior of CUDA applications. Normally packaged with the CUDA Toolkit, NVIDIA occasionally uses this page to provide CUPTI improvements and bug fixes between toolkit releases.

There is currently no CUPTI update to the CUDA Toolkit 13.0 Update 1. You may obtain the latest version of CUPTI by Downloading the CUDA Toolkit 13.0.1

Key Features

Trace CUDA API by registering callbacks for API calls of interest

Full support for entry and exit points in the CUDA C Runtime (CUDART) and CUDA Driver

GPU workload trace for the activities happening on the GPU, which includes kernel executions, memory operations (e.g., Host-to-Device memory copies) and memset operations.

CUDA Unified Memory trace for transfers from host to device, device to host, device to device and page faults on CPU and GPU etc.

Normalized timestamps for CPU and GPU trace

Profile hardware and software event counters, including:

Utilization metrics for various hardware units



Instruction count and throughput



Memory load/store events and throughput



Cache hits/misses



Branches and divergent branches



Many more

Enables automated bottleneck identification based on metrics such as instruction throughput, memory throughput, and more

Range profiling to enable metric collection over concurrent kernel launches within a range

Metrics attribution at the high-level source code and the executed assembly instructions.

Device-wide sampling of the program counter (PC). The PC Sampling gives the number of samples for each source and assembly line with various stall reasons.

Updates in CUDA Toolkit 13.0 Update 1

Resolved Issues Fixed the issue where synchronization records were not generated for the CUDA APIs cudaDeviceSynchronize() and cuCtxSynchronize_v2() .

and .

Fixed a potential deadlock that could occur while disabling the activity kind CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL using the API cuptiActivityDisable.

using the API cuptiActivityDisable.

Fixed a potential crash in the API cuptiActivityFlushAll .

.

Discarded Unified Memory counter records that could not be completed before disabling Unified Memory profiling.



Fixed the crash when NVTX API is called before the CUPTI API. This issue was introduced in the CUDA 13.0 GA release.



Fixed a segmentation fault issue while initializing the device buffer for storing profiling data. This segmentation fault occurs for long running applications using CUDA graphs when the attribute CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED is set to 0.

Updates in CUDA Toolkit 13.0

New Features Added support for NVLOG, a configurable logging and error-reporting library, enabling detailed logging with various severity levels (Info, Warning, Error, Fatal) and output options including files and stdout. For more details, refer to the section Logging Using NVLOG.



Introduced a new subscriber API, cuptiSubscribe_v2 , which includes additional parameters for the current and previous subscriber’s names (if applicable). This enhancement allows for easier identification of the previous subscriber in cases where the current subscription fails.

, which includes additional parameters for the current and previous subscriber’s names (if applicable). This enhancement allows for easier identification of the previous subscriber in cases where the current subscription fails.

Added new callbacks CUPTI_CBID_RESOURCE_GRAPH_NODE_UPDATED for updates to CUDA Graph node and CUPTI_CBID_RESOURCE_GRAPH_NODE_SET_PARAMS for when Graph parameters are set via the cuGraphExecNodeSetParams() API.

for updates to CUDA Graph node and for when Graph parameters are set via the API.

A new field isDeviceLaunched is added in the activity records for kernel, memcpy and memset operations to indicate whether operation is a part of the device launched graph. To accommodate this change, activity record CUpti_ActivityKernel9 is deprecated and replaced by a new activity record CUpti_ActivityKernel10 . With this change, collection of records for device launched graph is enabled by default if HES trace is enabled.

is added in the activity records for kernel, memcpy and memset operations to indicate whether operation is a part of the device launched graph. To accommodate this change, activity record is deprecated and replaced by a new activity record . With this change, collection of records for device launched graph is enabled by default if HES trace is enabled.

A new field, isManagedPool , has been added to the memory pool record to indicate whether the pool uses managed memory or pinned memory allocation. The CUpti_ActivityMemoryPool2 activity record is deprecated and replaced by the new CUpti_ActivityMemoryPool3 activity record.

, has been added to the memory pool record to indicate whether the pool uses managed memory or pinned memory allocation. The activity record is deprecated and replaced by the new activity record.

Initial support for parsing and decoding NVTX extended payloads via activity records. This enables tools to extract extended payload information from NVTX activity records. Libraries such as NCCL, which emit NVTX annotations using extended payloads, are now decodable via CUPTI. For more information, refer to the section NVIDIA Tools Extension (NVTX) Support.



Added an API cuptiActivityEnableCudaEventDeviceTimestamps to control collection of device timestamp for the CUPTI_ACTIVITY_KIND_CUDA_EVENT record. By default, the collection of CUDA event device timestamps is disabled.

to control collection of device timestamp for the record. By default, the collection of CUDA event device timestamps is disabled.

Starting with CUDA Toolkit 13.0, CUPTI API versions follow the format xxyyzz, where:



xx : Major CUDA Toolkit version





yy : Minor CUDA Toolkit version





zz : CUPTI-specific patch or update version (Note: This does not necessarily align with CUDA Toolkit update versions)



For example, version 130000 indicates CUDA Toolkit 13.0.



Ported the following samples from Profiling APIs to the new Range Profiler APIs: cupti_metric_properties, profiling_injection, callback_profiling and concurrent_profiling.



Added a new error code CUPTI_ERROR_INVALID_CHIP_NAME for invalid chip name passed in the Profiler Host APIs.

for invalid chip name passed in the Profiler Host APIs. Deprecated and dropped features The CUPTI Profiling API from the header cupti_profiler_target.h and the Perfworks Metric API from the header nvperf_host.h are deprecated in the CUDA 13.0 release and will be removed in a future CUDA release. It is recommended to use the CUPTI Range Profiling API as an alternative. For more information, refer to the section Evolution of the profiling APIs. Note that not all APIs from the header cupti_profiler_target.h are deprecated; a few supported APIs are used by other profiling features like Range Profiling and PM Sampling.

and the Perfworks Metric API from the header are deprecated in the CUDA 13.0 release and will be removed in a future CUDA release. It is recommended to use the CUPTI Range Profiling API as an alternative. For more information, refer to the section Evolution of the profiling APIs. Note that not all APIs from the header are deprecated; a few supported APIs are used by other profiling features like Range Profiling and PM Sampling.

The CUPTI Event API from the header cupti_events.h and the CUPTI Metric API from the header cupti_metrics.h are dropped. Calling any Event or Metric API will return the error code CUPTI_ERROR_LEGACY_PROFILER_NOT_SUPPORTED . It is recommended to use the CUPTI Range Profiling API as an alternative.

and the CUPTI Metric API from the header are dropped. Calling any Event or Metric API will return the error code . It is recommended to use the CUPTI Range Profiling API as an alternative.

The source/SASS level metrics from the header cupti_activity.h are dropped. It is recommended to move to the SASS Metric API from the header cupti_sass_metrics.h .

are dropped. It is recommended to move to the SASS Metric API from the header .

Removed support for the PowerPC (ppc64le) architecture.



The PC Sampling Activity API from the header cupti_activity.h is dropped. It is recommended to move to the PC Sampling API from the header cupti_pcsampling.h .

is dropped. It is recommended to move to the PC Sampling API from the header .

By default, CUPTI will request activity buffers separately for each thread that generates activity records. This will improve the runtime performance of the application as it avoids contention when multiple threads are generating activities concurrently.

Resolved Issues Reduction in the tracing collection overhead for graph launches under HES trace.



Fixed the issue that kernel activity records were showing stale information after updating the params through cuGraphExecKernelNodeSetParams() .

.

Fixed the hang which can occur when tracing cooperative kernels under MPS. This requires an NVIDIA display driver version of 580 or higher.



Fixed an issue where some public API symbols were set to hidden visibility in the static build of the CUPTI library.



On a WSL2 system, using CLOCK_MONOTONIC_RAW instead of CLOCK_REALTIME because the latter can cause backward jumps due to system time adjustments.

Updates in CUPTI Python Profiling APIs 13.0

Overview CUPTI Python provides Python APIs for creation of profiling and tracing tools that target CUDA Python applications and supports a subset of CUPTI C Activity and Callback APIs for linux x86_64.



cupti-python is available separately from the CUDA Toolkit.

Please refer to the CUPTI Python 13.0.0 Documentation Download page

Please refer to the CUPTI Python 13.0.0 New Features New Activity Kind cupti.cupti.ActivityKind.ROTATION and its corresponding activity class cupti.cupti.ActivityConfidentialComputeRotation has been added.

and its corresponding activity class has been added.

Support for Activity Kind cupti.cupti.ActivityKind.RUNTIME and Callback Domain cupti.CallbackDomain.RUNTIME_API have been added.

and Callback Domain have been added.

New activity classes cupti.cupti.ActivityKernel10 and cupti.cupti.ActivityMemoryPool3 have been introduced.

and have been introduced.

Support for APIs cupti.cupti.subscribe_v2 , cupti.cupti.activity_enable_cuda_event_device_timestamps and cupti.cupti.activity_enable_all_sync_records has been added.

, and has been added.

Added cupti.pyi stub file to improve IDE support, including type hints and auto-completion.

stub file to improve IDE support, including type hints and auto-completion.

Support for Python 3.13 has been added.



Support for Linux (aarch64 sbsa) architecture has been added.

Deprecated and dropped features Support for activity classes cupti.cupti.ActivityKernel9 and cupti.cupti.ActivityMemoryPool2 has been removed.

Requirements

Supported platforms Linux x86_64 [1]



Windows x86_64 [1]



Linux aarch64 SBSA [1]



DRIVE OS QNX aarch64 [2]



DRIVE OS Linux aarch64 [2]

[1] available in the CUDA Desktop Toolkit only

[2] available in the Embedded or Drive toolkits only

Supported NVIDIA GPU architectures Activity and Callback APIs



All architectures supported by CUDA Toolkit



Profiling and PC Sampling APIs



Blackwell: B100, GB10x, GB11x





Hopper: GH100





Ada: AD10x





Ampere: A100 with Multi-Instance GPU, GA10x





Turing

CUDA Toolkit CUPTI can be found in the CUDA Toolkit 13.0 Update 1 production release

Drivers Please use the following drivers 581.15 (Windows) available at the NVIDIA Driver Download page.





580.82.07 (Linux) provided with CUDA Toolkit 13.0 Update 1 production release.

Documentation

Support

See the CUPTI User Guide for a complete listing of hardware and software event counters available for performance analysis tools.

To provide feedback, request additional features, or report issues, please use the Developer Forums.

Installation Overview

When installing CUDA Toolkit 13.0 Update 1 and specifying options, be sure to select CUDA > Development > Tools > CUPTI.