NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains

NVIDIA CUDA 13.1 introduces the largest and most comprehensive update to the CUDA platform since it was invented two decades ago.

In this release, you’ll find new features and updates for improving performance and driving accelerated computing, including:

The launch of NVIDIA CUDA Tile, our tile-based programming model for abstracting away specialized hardware, including tensor cores.
Runtime API exposure of green contexts.
Emulation for double and single precisions in NVIDIA cuBLAS.
A completely rewritten CUDA programming guide, designed for both novice and advanced CUDA programmers.

CUDA Tile programming

To help create software for current and future GPUs, NVIDIA CUDA 13.1 is launching CUDA Tile, which enables you to write GPU kernels at a layer above SIMT. Currently, in SIMT programming, you specify kernels by partitioning data and defining each thread’s path of execution. Using CUDA Tile, you can bring your code up a layer and specify chunks of data called tiles. You specify the mathematical operations to be executed on those tiles, and the compiler and runtime determine the best way to launch that work onto individual threads. The tile model abstracts away the details of using specialized hardware such as tensor cores, and your tile code will be compatible with future GPU architectures.

CUDA 13.1 is releasing two components for tile programming.

CUDA Tile IR: A new virtual instruction set architecture (ISA) for programming NVIDIA GPUS.
cuTile Python: A new domain-specific language (DSL) for authoring array and tile-based kernels in Python.

In this first version of the software:

CUDA tile is supported on NVIDIA Blackwell (compute capability 10.x and 12.x) products only. Future versions of CUDA will add support for more architectures.
We’ve focused our development efforts on tile programming for AI algorithms. In future releases of CUDA, we’ll continue to add more features, functionality, and performance.
In an upcoming CUDA release, we plan to introduce an implementation in C++.

Check out more information about CUDA Tile IR and cuTile Python.

CUDA software updates

Here are some other important software updates included in this release of CUDA.

Runtime exposure of green contexts

Green contexts in CUDA are a lightweight alternative to traditional CUDA contexts, designed to provide developers with a mechanism for finer-grained spatial partitioning and resource provisioning on the GPU. They’ve been available in the driver API since CUDA 12.4, and starting now, green contexts are available in the runtime API.

Green contexts enable you to define and manage distinct partitions of GPU resources, primarily Streaming Multiprocessors (SMs), and dedicate a specific set of SMs to a particular context. You can then launch CUDA kernels and manage streams that run only within the resources provisioned for that green context. A typical example is when your application has latency-sensitive code with priority over all other GPU work. By allocating SM resources to a dedicated green context for this code, and the rest to another green context for other code, you guarantee available SMs ready for this computation.

CUDA 13.1 also introduces a more customizable split() API. Developers can build SM partitions that previously required multiple API calls and the ability to configure work queues to minimize false dependencies between work submitted in different green contexts.

The CUDA programming guide has more about these features and the runtime exposure of green contexts.

CUDA Multi-Process Service updates

CUDA 13.1 brings new features and functionality to Multi-Process Service (MPS). For complete information on these new features, please see the MPS documentation. A few of the highlights include:

Memory locality optimization partition

Memory locality optimization partition (MLOPart) is a feature on some NVIDIA Blackwell (compute capability 10.0 and 10.3) and newer GPUs. Users can create specialized CUDA devices optimized for improving memory locality. MLOPart devices are derived from a single underlying GPU but present as multiple devices with fewer compute resources and less available memory. Compute capability 10.0 and 10.3 GPUs each have two partitions.

When using MLOPart on supported GPUs, each partition appears as a distinct CUDA device, with associated compute and memory resources. Currently, MLOPart is only supported on NVIDIA B200 and NVIDIA B300 products, and a future release of CUDA will support the NVIDIA GB200 and NVIDIA GB300 products.

Static streaming multiprocessor partitioning

As an alternative to the current dynamic execution resource provisioning available in MPS, static streaming multiprocessor (SM) partitioning is a feature for NVIDIA Ampere architecture (compute capability 8.0) and newer GPUs that provides a way to create exclusive SM partitions for MPS clients.

This mode is enabled by launching the MPS control daemon with the -S or --static-partitioning flag, and its main purpose is to deliver deterministic resource allocation and improved isolation between MPS clients. The fundamental unit of partitioning is a “chunk,” which varies in size based on the GPU architecture—for example, 8 SMs on Hopper (compute capability 9.0) and newer discrete GPUs.

Emulation for double and single precisions in cuBLAS

While not strictly a CUDA 13.1 update, the cuBLAS update in NVIDIA CUDA Toolkit 13.0 introduced new APIs and implementations for boosting the performance of double-precision (FP64) matrix multiplications (matmuls). This is achieved through floating-point (FP) emulation on Tensor Cores found in GPU architectures such as NVIDIA GB200 NVL72, and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for both FP32 and FP64 emulation, refer to the cuBLAS documentation.

Developer tools

Developer tools are a crucial part of the CUDA platform. This release delivers several innovations and feature enhancements, including:

CUDA Tile kernel profiling

NVIDIA Nsight Compute 2025.4 adds support for profiling CUDA Tile kernels. Updates include a new “Result Type” column on the summary page for denoting Tile vs. SIMT kernels. A new “Tile Statistics” section on the details page summarizes Tile dimensions and utilization of important pipelines. The source page also supports mapping metrics to the high-level cuTile kernel source.

Figure 1. Nsight Compute profile highlighting the Tile Statistics section of the profile output

This Nsight Compute release also adds support for profiling CUDA graph nodes from device-launched graphs and source page navigation improvements with clickable label links for both compiler-generated and user-generated labels.

Compile-time patching

NVIDIA Compute Sanitizer 2025.4 adds support for NVIDIA CUDA Compiler (NVCC) compile-time patching through the -fdevice-sanitize=memcheck compiler flag. This patching enhances memory error detection and improves compute sanitizer performance.

Compile-time instrumentation integrates error detection directly into NVCC for faster runs while catching more subtle memory issues, such as illegal accesses between adjacent allocations, through advanced base-and-bounds analysis. This means you can debug memory problems without sacrificing speed, run more tests, and maintain productivity. Right now, only memcheck is supported.

To use this new feature, compile your code with the NVCC flag as follows

nvcc -fdevice-sanitize=memcheck -o myapp myapp.cu

Then run your application with compute-sanitizer using the memcheck tool.

compute-sanitizer --tool memcheck myapp

For complete information on compile-time patching, refer to the compute-sanitizer documentation.

NVIDIA Nsight Systems

NVIDIA Nsight Systems 2025.6.1 releases concurrently with CUDA Toolkit 13.1, with several new tracing features, including:

System-wide CUDA trace: --cuda-trace-scope enables tracing across process trees or the entire system.
CUDA host function trace: Added trace support for CUDA Graph host function nodes and cudaLaunchHostFunc(), which executes on the host and blocks the stream.
CUDA hardware trace: Hardware-based tracing is now the default when supported; use --trace=cuda-sw to revert to software mode.
Green context timeline rows now show SM allocation in tooltips to help users understand GPU resource utilization.

Math libraries

New features across our core CUDA Toolkit math libraries include:

NVIDIA cuBLAS: A new experimental API with Grouped GEMM supports FP8 and BF16/FP16 for Blackwell GPUs. Grouped GEMMs for the noted datatypes with CUDA Graph support provides a host-synchronization-free implementation with device-side shapes for up to 4x speed-up over a multi-stream GEMM implementation in the MoE use case.
NVIDIA cuSPARSE: A new sparse matrix vector multiplication (SpMVOp) API with improved performance compared to the CsrMV API. This API supports CSR format, 32-bit index, double precision, and user-defined epilogues.
NVIDIA cuFFT: A new set of APIs, called cuFFT device API, provides host functions for querying or generating device function code and database metadata in a C++ header file. Designed for the cuFFTDx library, it facilitates the generation of cuFFTDx code blocks by querying cuFFT, which can be linked with the cuFFTDx application to improve performance.

Performance updates on new Blackwell architectures are available. Select updates for key APIs, and performance follows.

cuBLAS Blackwell performance

CUDA Toolkit 12.9 introduced block-scaled FP4 and FP8 matmuls on NVIDIA Blackwell. CUDA 13.1 adds performance support for these data types and BF16. Speedups on NVIDIA Blackwell and Hopper are shown in Figure 2.

Speedups of B200, GB200, B300 and GB300 over H200 for BF16, FP8, block-scaled FP8 and block-scaled FP4. The trend is that B200 and GB200 are both typically 2X of H200, while B300 and GB300 are between 2X and 6X, depending on the test. — *Figure 2: Speedups of various Blackwell products over H200, for different numerical precisions*

cuSOLVER Blackwell performance

CUDA 13.1 continues to improve batched SYEVD and GEEV APIs for eigen-decomposition, delivering performance enhancements.

Batched SYEV (cusolverDnXsyevBatched) is a uniform batched version of cuSOLVER’s SYEV routine, computing eigenvalues and eigenvectors for symmetric/Hermitian matrices, ideal for parallel solving of many small matrices.

Figure 3 shows tests on a batch size of 5,000 (24-256 rows) with about a 2x speedup on the NVIDIA Blackwell RTX Pro 6000 Server Edition compared to the NVIDIA L40S, correlating with expected memory throughput increases.

For both complex single and real single, the speedup starts at 1.5X for N rows equals 5, and increases gradually to 2.0 for N rows equals 250. — *Figure 3. cuSOLVER Batched SYEV speed on NVIDIA RTX PRO 6000 Blackwell compared to NVIDIA L40S*

Figure 4 shows the performance speed-up of cusolverDnXgeev (GEEV), which computes eigenvalues and eigenvectors of a general (non-symmetric) dense matrix. GEEV is a hybrid CPU/GPU algorithm. A single CPU thread manages aggressive early deflation in the QR algorithm, while the GPU handles the rest. Relative performance speed-ups for matrix sizes from 1,024 to 32,768 are shown.

The curves are about 1.0 for n rows equals 5000, and gradually increase to 1.7 for n rows equals 30000. — *Figure 4. GEEV speed up on NVIDIA RTX PRO 6000 Blackwell relative to NVIDIA L40S (RTX PRO 6000 Blackwell with Intel(R) Xeon(R) Gold 6444Y, and L40S with AMD EPYC 7313P 16 core processor)*

NVIDIA CUDA Core Compute Libraries

NVIDIA CUDA Core Compute Libraries (CCCL) features several innovations and enhancements for CUB.

Deterministic floating-point reductions

Due to the non-associativity of floating point addition, cub::DeviceReduce historically only guaranteed bitwise-identical results run-to-run on the same GPU. This was implemented as a two-pass algorithm.

NVIDIA CCCL 3.1, part of CUDA 13.1, provides two additional floating-point determinism options for you to make trade-offs between determinism and performance.

Not-guaranteed: Single-pass reduction using atomics. This isn’t guaranteed to provide bitwise-identical results.
GPU-to-GPU: Based on reproducible reduction in Kate Clark’s NVIDIA GTC 2024 talk. Results are always bitwise-identical.

The determinism option can be set through a flag, as shown in the following code.

// Pick your desired trade-off of performance and determinism
// auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
// auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
// auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(..., env);

A chart showing that In all cases, the order of fastest to slowest is not-guaranteed, then run-to-run, and then gpu-to-gpu. — *Figure 5. Normalized execution times for the three different floating point determinism choices for CUB::DeviceReduce*

More convenient single-phase CUB APIs

Nearly every CUB algorithm requires temporary storage for intermediate scratch space. Historically, users had to query and allocate the necessary temporary storage through a two-phase call pattern that is cumbersome and error-prone if arguments aren’t passed the same between two invocations.

CCCL 3.1 adds new overloads to some CUB algorithms that accept a memory resource, so you skip the temp-storage query/allocate/free pattern.

Before (two-phase)

// determine temporary storage size
cub::DeviceScan::ExclusiveSum(d_temp_storage, 
               temp_storage_bytes, 
               nullptr, ...);
 
// Allocate the required temporary storage
cudaMallocAsync(&d_temp_storage,
        temp_storage_bytes, stream);
 
// run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage,
               temp_storage_bytes, 
               d_input...);

 // Free the temporary storage
cudaFreeAsync(temp_storage, stream);

After (single-phase)

// Pool mr uses cudaMallocAsync under the hood
cuda::device_memory_pool mr{cuda::devices[0]};

// Single call. Temp storage is handled by the pool.
cub::DeviceScan::ExclusiveSum(d_input,..., mr);

Learn more

The release of CUDA 13.1 brings many new features and ushers in a new era of GPU programming with CUDA Tile. Check out CUDA Tile resources, download CUDA Toolkit 13.1, and get started today.

Acknowledgements

Thanks to the following NVIDIA contributors: Jake Hemstad, Becca Zandstein, Jackson Marusarz, Kyrylo Perelygin, and Myrto Papadopoulou.

NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains

CUDA Tile programming

CUDA software updates

Runtime exposure of green contexts

CUDA Multi-Process Service updates

Memory locality optimization partition

Static streaming multiprocessor partitioning

Emulation for double and single precisions in cuBLAS

Developer tools

CUDA Tile kernel profiling

Compile-time patching

NVIDIA Nsight Systems

Math libraries

cuBLAS Blackwell performance

cuSOLVER Blackwell performance

NVIDIA CUDA Core Compute Libraries

Deterministic floating-point reductions

More convenient single-phase CUB APIs

Learn more

Tags

About the Authors

NVIDIA CUDA 13.1 Powers Next-Gen GPU Programming with NVIDIA CUDA Tile and Performance Gains

CUDA Tile programming

CUDA software updates

Runtime exposure of green contexts

CUDA Multi-Process Service updates

Memory locality optimization partition

Static streaming multiprocessor partitioning

Emulation for double and single precisions in cuBLAS

Developer tools

CUDA Tile kernel profiling

Compile-time patching

NVIDIA Nsight Systems

Math libraries

cuBLAS Blackwell performance

cuSOLVER Blackwell performance

NVIDIA CUDA Core Compute Libraries

Deterministic floating-point reductions

More convenient single-phase CUB APIs

Learn more

Tags

About the Authors

Comments

Related posts

Focus on Your Algorithm—NVIDIA CUDA Tile Handles the Hardware

What’s New and Important in CUDA Toolkit 13.0

Discovering New Features in CUDA 11.4

Exploring the New Features of CUDA 11.3

CUDA Refresher: Getting started with CUDA

Related posts

Simplify GPU Programming with NVIDIA CUDA Tile in Python

Focus on Your Algorithm—NVIDIA CUDA Tile Handles the Hardware

NVIDIA-Accelerated Mistral 3 Open Models Deliver Efficiency, Accuracy at Any Scale

How to Scale Data Generation for Physical AI with the NVIDIA Cosmos Cookbook

Making GPU Clusters More Efficient with NVIDIA Data Center Monitoring Tools