NVIDIA CUDA 13.1 introduces the largest and most comprehensive update to the CUDA platform since it was invented two decades ago.
In this release, you’ll find new features and updates for improving performance and driving accelerated computing, including:
- The launch of NVIDIA CUDA Tile, our tile-based programming model for abstracting away specialized hardware, including tensor cores.
- Runtime API exposure of green contexts.
- Emulation for double and single precisions in NVIDIA cuBLAS.
- A completely rewritten CUDA programming guide, designed for both novice and advanced CUDA programmers.
CUDA Tile programming
To help create software for current and future GPUs, NVIDIA CUDA 13.1 is launching CUDA Tile, which enables you to write GPU kernels at a layer above SIMT. Currently, in SIMT programming, you specify kernels by partitioning data and defining each thread’s path of execution. Using CUDA Tile, you can bring your code up a layer and specify chunks of data called tiles. You specify the mathematical operations to be executed on those tiles, and the compiler and runtime determine the best way to launch that work onto individual threads. The tile model abstracts away the details of using specialized hardware such as tensor cores, and your tile code will be compatible with future GPU architectures.
CUDA 13.1 is releasing two components for tile programming.
- CUDA Tile IR: A new virtual instruction set architecture (ISA) for programming NVIDIA GPUS.
- cuTile Python: A new domain-specific language (DSL) for authoring array and tile-based kernels in Python.
In this first version of the software:
- CUDA tile is supported on NVIDIA Blackwell (compute capability 10.x and 12.x) products only. Future versions of CUDA will add support for more architectures.
- We’ve focused our development efforts on tile programming for AI algorithms. In future releases of CUDA, we’ll continue to add more features, functionality, and performance.
- In an upcoming CUDA release, we plan to introduce an implementation in C++.
Check out more information about CUDA Tile IR and cuTile Python.
CUDA software updates
Here are some other important software updates included in this release of CUDA.
Runtime exposure of green contexts
Green contexts in CUDA are a lightweight alternative to traditional CUDA contexts, designed to provide developers with a mechanism for finer-grained spatial partitioning and resource provisioning on the GPU. They’ve been available in the driver API since CUDA 12.4, and starting now, green contexts are available in the runtime API.
Green contexts enable you to define and manage distinct partitions of GPU resources, primarily Streaming Multiprocessors (SMs), and dedicate a specific set of SMs to a particular context. You can then launch CUDA kernels and manage streams that run only within the resources provisioned for that green context. A typical example is when your application has latency-sensitive code with priority over all other GPU work. By allocating SM resources to a dedicated green context for this code, and the rest to another green context for other code, you guarantee available SMs ready for this computation.
CUDA 13.1 also introduces a more customizable split() API. Developers can build SM partitions that previously required multiple API calls and the ability to configure work queues to minimize false dependencies between work submitted in different green contexts.
The CUDA programming guide has more about these features and the runtime exposure of green contexts.
CUDA Multi-Process Service updates
CUDA 13.1 brings new features and functionality to Multi-Process Service (MPS). For complete information on these new features, please see the MPS documentation. A few of the highlights include:
Memory locality optimization partition
Memory locality optimization partition (MLOPart) is a feature on some NVIDIA Blackwell (compute capability 10.0 and 10.3) and newer GPUs. Users can create specialized CUDA devices optimized for improving memory locality. MLOPart devices are derived from a single underlying GPU but present as multiple devices with fewer compute resources and less available memory. Compute capability 10.0 and 10.3 GPUs each have two partitions.
When using MLOPart on supported GPUs, each partition appears as a distinct CUDA device, with associated compute and memory resources. Currently, MLOPart is only supported on NVIDIA B200 and NVIDIA B300 products, and a future release of CUDA will support the NVIDIA GB200 and NVIDIA GB300 products.
Static streaming multiprocessor partitioning
As an alternative to the current dynamic execution resource provisioning available in MPS, static streaming multiprocessor (SM) partitioning is a feature for NVIDIA Ampere architecture (compute capability 8.0) and newer GPUs that provides a way to create exclusive SM partitions for MPS clients.
This mode is enabled by launching the MPS control daemon with the -S or --static-partitioning flag, and its main purpose is to deliver deterministic resource allocation and improved isolation between MPS clients. The fundamental unit of partitioning is a “chunk,” which varies in size based on the GPU architecture—for example, 8 SMs on Hopper (compute capability 9.0) and newer discrete GPUs.
Emulation for double and single precisions in cuBLAS
While not strictly a CUDA 13.1 update, the cuBLAS update in NVIDIA CUDA Toolkit 13.0 introduced new APIs and implementations for boosting the performance of double-precision (FP64) matrix multiplications (matmuls). This is achieved through floating-point (FP) emulation on Tensor Cores found in GPU architectures such as NVIDIA GB200 NVL72, and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for both FP32 and FP64 emulation, refer to the cuBLAS documentation.
Developer tools
Developer tools are a crucial part of the CUDA platform. This release delivers several innovations and feature enhancements, including:
CUDA Tile kernel profiling
NVIDIA Nsight Compute 2025.4 adds support for profiling CUDA Tile kernels. Updates include a new “Result Type” column on the summary page for denoting Tile vs. SIMT kernels. A new “Tile Statistics” section on the details page summarizes Tile dimensions and utilization of important pipelines. The source page also supports mapping metrics to the high-level cuTile kernel source.

This Nsight Compute release also adds support for profiling CUDA graph nodes from device-launched graphs and source page navigation improvements with clickable label links for both compiler-generated and user-generated labels.
Compile-time patching
NVIDIA Compute Sanitizer 2025.4 adds support for NVIDIA CUDA Compiler (NVCC) compile-time patching through the -fdevice-sanitize=memcheck compiler flag. This patching enhances memory error detection and improves compute sanitizer performance.
Compile-time instrumentation integrates error detection directly into NVCC for faster runs while catching more subtle memory issues, such as illegal accesses between adjacent allocations, through advanced base-and-bounds analysis. This means you can debug memory problems without sacrificing speed, run more tests, and maintain productivity. Right now, only memcheck is supported.
To use this new feature, compile your code with the NVCC flag as follows
nvcc -fdevice-sanitize=memcheck -o myapp myapp.cu
Then run your application with compute-sanitizer using the memcheck tool.
compute-sanitizer --tool memcheck myapp
For complete information on compile-time patching, refer to the compute-sanitizer documentation.
NVIDIA Nsight Systems
NVIDIA Nsight Systems 2025.6.1 releases concurrently with CUDA Toolkit 13.1, with several new tracing features, including:
- System-wide CUDA trace:
--cuda-trace-scopeenables tracing across process trees or the entire system. - CUDA host function trace: Added trace support for CUDA Graph host function nodes and
cudaLaunchHostFunc(), which executes on the host and blocks the stream. - CUDA hardware trace: Hardware-based tracing is now the default when supported; use
--trace=cuda-swto revert to software mode. - Green context timeline rows now show SM allocation in tooltips to help users understand GPU resource utilization.
Math libraries
New features across our core CUDA Toolkit math libraries include:
- NVIDIA cuBLAS: A new experimental API with Grouped GEMM supports FP8 and BF16/FP16 for Blackwell GPUs. Grouped GEMMs for the noted datatypes with CUDA Graph support provides a host-synchronization-free implementation with device-side shapes for up to 4x speed-up over a multi-stream GEMM implementation in the MoE use case.
- NVIDIA cuSPARSE: A new sparse matrix vector multiplication (SpMVOp) API with improved performance compared to the CsrMV API. This API supports CSR format, 32-bit index, double precision, and user-defined epilogues.
- NVIDIA cuFFT: A new set of APIs, called cuFFT device API, provides host functions for querying or generating device function code and database metadata in a C++ header file. Designed for the cuFFTDx library, it facilitates the generation of cuFFTDx code blocks by querying cuFFT, which can be linked with the cuFFTDx application to improve performance.
Performance updates on new Blackwell architectures are available. Select updates for key APIs, and performance follows.
cuBLAS Blackwell performance
CUDA Toolkit 12.9 introduced block-scaled FP4 and FP8 matmuls on NVIDIA Blackwell. CUDA 13.1 adds performance support for these data types and BF16. Speedups on NVIDIA Blackwell and Hopper are shown in Figure 2.

cuSOLVER Blackwell performance
CUDA 13.1 continues to improve batched SYEVD and GEEV APIs for eigen-decomposition, delivering performance enhancements.
Batched SYEV (cusolverDnXsyevBatched) is a uniform batched version of cuSOLVER’s SYEV routine, computing eigenvalues and eigenvectors for symmetric/Hermitian matrices, ideal for parallel solving of many small matrices.
Figure 3 shows tests on a batch size of 5,000 (24-256 rows) with about a 2x speedup on the NVIDIA Blackwell RTX Pro 6000 Server Edition compared to the NVIDIA L40S, correlating with expected memory throughput increases.

Figure 4 shows the performance speed-up of cusolverDnXgeev (GEEV), which computes eigenvalues and eigenvectors of a general (non-symmetric) dense matrix. GEEV is a hybrid CPU/GPU algorithm. A single CPU thread manages aggressive early deflation in the QR algorithm, while the GPU handles the rest. Relative performance speed-ups for matrix sizes from 1,024 to 32,768 are shown.

NVIDIA CUDA Core Compute Libraries
NVIDIA CUDA Core Compute Libraries (CCCL) features several innovations and enhancements for CUB.
Deterministic floating-point reductions
Due to the non-associativity of floating point addition, cub::DeviceReduce historically only guaranteed bitwise-identical results run-to-run on the same GPU. This was implemented as a two-pass algorithm.
NVIDIA CCCL 3.1, part of CUDA 13.1, provides two additional floating-point determinism options for you to make trade-offs between determinism and performance.
- Not-guaranteed: Single-pass reduction using atomics. This isn’t guaranteed to provide bitwise-identical results.
- GPU-to-GPU: Based on reproducible reduction in Kate Clark’s NVIDIA GTC 2024 talk. Results are always bitwise-identical.
The determinism option can be set through a flag, as shown in the following code.
// Pick your desired trade-off of performance and determinism
// auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
// auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
// auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(..., env);

More convenient single-phase CUB APIs
Nearly every CUB algorithm requires temporary storage for intermediate scratch space. Historically, users had to query and allocate the necessary temporary storage through a two-phase call pattern that is cumbersome and error-prone if arguments aren’t passed the same between two invocations.
CCCL 3.1 adds new overloads to some CUB algorithms that accept a memory resource, so you skip the temp-storage query/allocate/free pattern.
Before (two-phase)
// determine temporary storage size
cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
nullptr, ...);
// Allocate the required temporary storage
cudaMallocAsync(&d_temp_storage,
temp_storage_bytes, stream);
// run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
d_input...);
// Free the temporary storage
cudaFreeAsync(temp_storage, stream);
After (single-phase)
// Pool mr uses cudaMallocAsync under the hood
cuda::device_memory_pool mr{cuda::devices[0]};
// Single call. Temp storage is handled by the pool.
cub::DeviceScan::ExclusiveSum(d_input,..., mr);
Learn more
The release of CUDA 13.1 brings many new features and ushers in a new era of GPU programming with CUDA Tile. Check out CUDA Tile resources, download CUDA Toolkit 13.1, and get started today.
Acknowledgements
Thanks to the following NVIDIA contributors: Jake Hemstad, Becca Zandstein, Jackson Marusarz, Kyrylo Perelygin, and Myrto Papadopoulou.