What’s New and Important in CUDA Toolkit 13.0

The newest update to the CUDA Toolkit, version 13.0, features advancements to accelerate computing on the latest NVIDIA CPUs and GPUs. As a major release, it lays the foundation for all future developments coming to the full CUDA 13.X software lineup. You can access these new features now.

This post highlights some of the new features and enhancements included with this release:

Building the foundation for tile-based programming in CUDA
Unification of the developer experience on Arm platforms, especially DGX Spark
Updated OS and platform support, including Red Hat Enterprise Linux 10
NVIDIA Nsight Developer Tools updates
Math libraries linear algebra and FFT updates
NVCC Compiler updates, including an improved fatbin compression scheme, and support for GCC 15 and Clang 20
Accelerated Python cuda.core release and developer-friendly packaging
Feature-complete architectures
Updated vector types with 32 byte alignment for increased performance on Blackwell
Support for Jetson Thor

Blackwell GPUs supported by CUDA 13.0

The Blackwell architecture, first supported in CUDA Toolkit 12.8, continues to improve in performance and capability. CUDA 13.0 supports the latest Blackwell GPUs, including:

B200 and GB200
B300 and GB300
RTX PRO Blackwell series
RTX 5000 series (GeForce)
Jetson Thor
DGX Spark

What’s in CUDA 13.0 beyond

Each new CUDA release delivers performance gains and improves programmability across the entire stack. From the beginning, CUDA has embraced a thread-parallel model using Single Instruction, Multiple Threads (SIMT). Now, with CUDA 13.0, we’re laying the foundation for a second, complementary model: tile-based programming.

Tile (or array) programming models are already common in many high-level languages, with Python being a prime example. When working with NumPy, you can apply simple, expressive commands to entire arrays or matrices, and the system handles the low-level execution. This abstraction boosts productivity by letting you focus on the what, not the how—designing performant algorithms without managing thread-level detail.

At GTC 2025, NVIDIA announced plans to bring this tile programming model to CUDA. This is a major step forward for developer productivity and hardware efficiency.

*Figure 1. Diagram showing how a tile programming model operates on entire blocks of data compared to SIMT programming, which operates in individual threads.*

In the tile programming model, you define tiles of data and specify operations over those tiles. The compiler and runtime take care of distributing work across threads and optimizing hardware usage. This higher-level abstraction frees you from managing low-level thread behavior while still unlocking full GPU performance.

Crucially, the tile model maps naturally onto Tensor Cores. The compiler handles tile memory management and operation mapping, enabling programs written today to take advantage of current and future Tensor Core architectures. This ensures forward compatibility: Write once, run fast—now and on GPUs to come.

The tile programming model will be available at two levels:

High-level APIs and Domain-Specific Languages (DSLs) – Programmers can use tiles directly in Python, C++, and other languages.
Intermediate Representation (IR) – Compiler and tool developers can target a new CUDA Tile IR backend, allowing them to take advantage of the tile model’s performance and hardware features.

CUDA 13.0, as a major release, introduces low-level infrastructure changes necessary to support this model. While most changes are invisible to end users, they lay the groundwork for a new way of programming GPUs: one that combines ease-of-use with maximum performance and long-term portability.

Unifying CUDA for Arm: Build once, deploy anywhere

With CUDA 13.0, NVIDIA is streamlining development for Arm platforms by unifying the CUDA toolkit across server-class and embedded devices. Going forward, you will no longer need to maintain separate installations or toolchains for Server Base System Architecture (SBSA)-compliant servers and next-generation embedded systems like Thor. Instead, a single CUDA install will support all Arm targets—with the exception of Orin (sm_87), which will continue on its current path for now.

This change unlocks a major productivity win. You can now build a robotics or AI application once, simulate it on high-performance systems like DGX Spark, and deploy the exact same binary—without any code changes—directly onto embedded targets like Thor. The barriers between simulation and deployment are gone.

The old way: two worlds, two toolchains

Previously, developing for both Arm servers and NVIDIA’s embedded platforms meant juggling parallel ecosystems. Developers targeting SBSA platforms, like Grace-based servers or Arm workstations, used the standard aarch64 CUDA Toolkit. This came with its own sysroots, libraries, and container images, among other items.

Meanwhile, development for embedded platforms like Jetson relied on JetPack and the L4T software stack, which also included its own customized CUDA components, header layouts, and board support packages—and often required cross-compilation from x86 or Grace systems. Many teams had to maintain separate build scripts, Continuous Integration (CI) jobs, and container registries, just to support what was often the same application logic. Moving a project from development to simulation to deployment involved redundant work, custom glue code, and inevitable mismatches between versions and configurations.

The new way: one toolkit, all targets

CUDA 13.0 marks a turning point. With the unified toolkit, developers can target both SBSA servers and future embedded platforms from a single CUDA install. The same compiler, headers, and libraries now apply across the board, and switching targets is simply a matter of building for the right compute architecture (e.g., sm_XX), not swapping SDKs.

This also extends to containers. NVIDIA is consolidating its image ecosystem so that simulation, testing, and deployment workflows can rely on a shared container lineage. That means fewer rebuilds, less CI overhead, and a smoother path from code to hardware.

Importantly, all of this comes with zero sacrifice in performance or flexibility. The compiler and runtime still generate optimized code for the target GPU architecture; you simply don’t have to manage two toolchains to get there.

Why this matters

For developers, this unification is a major boost in productivity. It reduces duplication in CI pipelines, simplifies container management, and eliminates the subtle bugs and inconsistencies that come from juggling different SDKs. Teams can now focus on what they care about—algorithms, performance, and deployment—not toolchain wiring.

For organizations, the value is even broader. A single source of truth for builds across simulation and edge platforms means less engineering time spent on infrastructure and more focus on innovation. The portability built into this new model future-proofs applications across evolving GPU generations and platforms.

OS and platform support

CUDA Toolkit 13.0 has been qualified and tested against the following new operating systems. Note that this is not the complete support matrix. Please refer to the release notes or the CUDA Installation Guides for Windows or Linux for the complete list.

Red Hat Enterprise Linux 10.0 and 9.6
Debian 12.10
Fedora 42
Rocky Linux 10.0 and 9.6

Developer tools

NVIDIA Nsight Compute 2025.3 adds the Instruction Mix and Scoreboard Dependency tables to the source view. This allows users to pinpoint source lines impacted by long dependency stalls and identify the input and output dependency locations more efficiently. These tables also break down the instruction types within a source line (Floating Point, Integer, Data Movement, etc.) to provide more details on the root cause of a dependency stall.

Nsight Compute graphical window showing the Instruction Mix and Scoreboard Dependency Tables. — *Figure 2: Nsight Compute Instruction Mix and Scoreboard Dependency tables*

Additionally, there is a new “Throughput Breakdown” section in the Metric Details window that shows the individual unit throughput metrics. Any of these units could be a limiting factor of throughput, and this window helps users understand how each of these is performing.

Math libraries

We’ve added new features to the CUDA Toolkit math libraries including:

cuBLAS:
- Improved performance for BLAS L3 (non-GEMM) kernels (SYRK, HERK, TRMM, and SYMM) with FP32 and CF32 precisions on NVIDIA Blackwell GPUs
- Experimental feature: The cublasGemmEx, cublasGemmBatchedEx, and cublasGemmStridedBatchedEx functions now accept CUBLAS_GEMM_AUTOTUNE as a valid value for the algo parameter. When this option is used, cuBLAS benchmarks a selection of available algorithms internally and chooses the optimal one based on the given problem configuration. The selected algorithm is cached within the current cublasHandle_t, so subsequent calls with the same problem descriptor will reuse the cached configuration for improved performance. View cuBLAS release notes for more details.
cuSPARSE: Now supports 64-bit index matrices in SpGEMM computations
cuSOLVER: A new math mode leverages the improved performance of emulated FP32 arithmetic on Blackwell GPUs. Control this feature using the new APIs: cusolverDnSetMathMode, cusolverDnGetMathMode(), cusolverDnSetEmulationStrategy, and cusolverDnGetEmulationStrategy.
Additionally, cusolverDnXsyevBatched now offers performance boosts for matrices up to size n<=32 on Blackwell GPUs, thanks to an internal algorithm switch. You can revert to the old algorithm for all problem sizes using cusolverDnSetAdvOptions. See the cusolverDnXsyevBatched documentation for more details.
cuFFT: Enhanced performance for single-precision C2C multi-dimensional FFTs and certain large power-of-2 FFTs

Updates to NVCC

New compiler updates include the following:

The CUDA Toolkit 13.0 release introduces support for GCC 15 and Clang 20 as a host-side compiler, and removes support for ICC and MSVC 2017.
Two significant changes to the NVIDIA CUDA Compiler Driver (NVCC) will impact ELF visibility and linkage for __global__ functions and device variables. More details are available in this detailed technical blog.
In 13.0, we have introduced a new language feature to the NVIDIA CUDA Compiler Driver (NVCC) and the NVIDIA Parallel Thread Execution ISA to enhance separate compilation. These changes allow programmers to specify a performant custom ABI (Application Binary Interface) on any __device__ function that is compiled separately.

CUDA 13.0 improves default compression with ZStandard (ZStd)

CUDA Toolkit 13.0 changes the default compression scheme for fatbins to rely on Zstandard, a state-of-the-art compression algorithm that yields better compression ratios than our old compression based on LZ4. Fatbins, otherwise known as NVIDIA device code fat binaries, are containers that store multiple versions of code for different architectures. Here at NVIDIA, we use them to bundle GPU code for multiple architectures, such as sm_89 and sm_100, to maximize compatibility and runtime performance by ensuring optimized versions of code are available for each architecture we target.

Every time you create an executable file with CUDA—whether through NVCC or our HPC compiler, NVC++—the device code gets packaged inside of a fatbin. Because we have multiple copies of the same code, we compress each copy during compilation and only decompress the relevant entry at runtime, saving binary size.

We recognize that binary size is an important issue. In CUDA Toolkit 12.8, we introduced the --compress-mode option to NVCC (and similar options in fatbinary and nvFatbin), that allowed you to choose various modes of compression to help deal with this. However, because of when it was developed, we could not guarantee that all options would work across all versions of the drivers compatible with the CUDA 12.X Toolkits.

With 13.0, we can now guarantee that these options are compatible with all drivers compatible with 13.X toolkits. More importantly, that gave us the opportunity to switch the default compression mode to match the balance compress mode, which relies on Zstandard. This improves the compression ratios with negligible slowdown in execution time.

Bar chart showing comparison of speed, balance, and size for the fatbin compression. Speed mode is the former default LZ4 compression scheme and is 100% for cublas, math_bench, npp, and thrust. Balance mode is roughly 10% less than speed mode, and size mode is between 30% and 90% of speed mode. — Figure 3. Existing “speed” mode that was formerly default based on LZ4 compression ratio compared to a new “balance” mode default starting from 13.0, as well as “size” mode also available via “–compress-mode.”

In some libraries, we noticed little difference between using ZSTD and LZ4, even when size was used. We believe these libraries are mostly dominated by data that is not impacted by this, such as host code. In other cases, we saw more significant reductions in size. For example, with the CUDA Math APIs, we are removing 17% of the size with the new default. If you were to try out size compress-mode, which also uses ZSTD, there are even more dramatic improvements possible, such as CUDA Math APIs getting down to a 71% reduction in size.

We saw no overall size regressions at the library level, and similarly, we also saw no significant geomean regression in execution time in our testing.

We know that in some applications, decompression time remains a critical issue. For these cases we are still maintaining the speed option for --compress-mode. It maintains the compression scheme used before 13.0, based on LZ4 that targets decompression time over compression ratio. In addition, we know compression ratio could still be improved further for some people. For you, we are still maintaining the size option for --compress-mode that uses an even more aggressive compression scheme, also based on Zstandard. If you don’t want any compression, we continue to provide --compress-mode=none and --no-compress.

Changes to these defaults are reflected in NVCC, NVC++, fatbinary, and nvFatbin, among others. To start benefitting from these improvements, simply start using CUDA Toolkit 13.0 today.

Accelerated Python

Early release of core object model for CUDA Python

cuda.core is a component of the cuda-python project, which provides Pythonic access to NVIDIA’s CUDA platform. It specifically focuses on offering intuitive Python APIs for core CUDA functionalities, including runtime control, compiler, and linker features. This allows Python developers to leverage the performance benefits of GPU acceleration without needing to write extensive CUDA C++ code.

Key aspects of cuda.core:

Pythonic API: It aims to provide a user-friendly and Python-idiomatic interface to CUDA’s core features.
Core CUDA Functionality: It provides access to essential CUDA runtime and development tools from within Python.
Interoperability: It is designed to work seamlessly with other Python GPU libraries and frameworks within the CUDA ecosystem, such as Numba and CuPy.

Wheel package changes to CUDA 13.0

Recently we identified areas for improvement in the consumption of CUDA Toolkit wheels packages. This includes how wheels unpack Toolkit components in the file directory structure, losing the wheel tags for new CUDA versions, and a new metapackage for installing only parts of the toolkit.

Wheels are the only platform where each component of the CUDA Toolkit is shipped in its own folder e.g., site-packages/nvidia/cublas for cuBLAS and site-packages/nvidia/cuda_cccl for CCCL. This makes it difficult to link and ship many CUDA Toolkit wheels because you have to search for each component separately when creating Python packages that depend on CUDA Toolkit wheels. Starting with CUDA Toolkit 13.0 wheels, we will ship all of the CUDA Toolkit artifacts in the same folder in order to make CUDA Toolkit wheels file layouts closer to the file layouts on other platforms. For example, site-packages/nvidia/cu13/include will contain headers for both cuBLAS and cuFFT (if installed).

The versioned CUDA Toolkit subdirectory (which we use on other platforms) also makes it easier to ensure that components are not mismatched between CUDA Toolkit versions. Some non-CUDA-Toolkit packages from NVIDIA, which depend on the CUDA Toolkit, will also start using the new package layout to increase user convenience.

To ensure users were able to download the correct CUDA version with a toolkit component, we tagged the files, such as “cu12” on the package name nvidia-cuda-cccl-cu12 for installing cccl. Going forward, CTK package names will now lose the CUDA tag suffix, i.e. cccl package will be named nvidia-cuda-cccl. This will allow upstream libraries to use the version of the package to select the CUDA version. Non CTK packages will continue to use the file suffix so they can be built for multiple CUDA versions.

A new cuda-toolkit meta-package can be used to install all or part of the CUDA Toolkit for a given version. For example, pip install cuda-toolkit[cublas,cudart] == 13.0 will install nvidia-cublas and nvidia-cuda-runtime for the 13.0 CUDA Toolkit release. pip install cuda-toolkit[all] will install all components of the CUDA Toolkit. Versions of the cuda-toolkit meta-package have also been released for previous CUDA versions to make it easier to migrate to the meta-package.

CUDA Core Compute Library (CCCL)

CUDA Toolkit 13.0 includes version 3.0 of the CUDA Core Compute Library (CCCL). CCCL is the unification of the formerly separate Thrust, CUB, and libcudacxx libraries. It provides modern abstractions that simplify CUDA programming.

Key changes in this release:

CCCL headers have moved in CUDA 13.0

Starting with 13.0, the CCCL headers bundled with CUDA Toolkit install have moved to new top-level directories under ${CTK_ROOT}/include/cccl/

Before CUDA 13.0	After CUDA 13.0
`${CTK_ROOT}/include/cuda/`	`${CTK_ROOT}/include/cccl/cuda/`
`${CTK_ROOT}/include/cub/`	`${CTK_ROOT}/include/cccl/cub/`
`${CTK_ROOT}/include/thrust/`	$`{CTK_ROOT}/include/cccl/thrust/`

Table 1. Where CCCL headers exist in older CUDA versions (left) compared to CUDA 13.0.

As a result, you may see errors about missing headers such as <thrust/device_vector.h> or <cub/cub.cuh>. To ensure a smooth transition:

❌ Do not write #include <cccl/...> — this will break.
If including CCCL headers only in files compiled with nvcc
- ✅ No action needed. (This is the common case for most users)
- If using CCCL headers in files compiled exclusively with a host compiler (e.g., GCC, Clang, MSVC):
- Using CMake and linking CCCL::CCCL
  - ✅ No action needed. (This is the recommended path. See example)
- Other build systems
  - ⚠️ Add ${CTK_ROOT}/include/cccl to your compiler’s include search path (e.g., with -I)

These changes prevent issues when mixing CCCL headers bundled with the CUDA Toolkit and those from external package managers. For more detail, see the CCCL 2.x to 3.0 Migration Guide.

Breaking changes

CCCL 3.0 includes a number of breaking changes, but most clean up internal details or remove functionality that now has a superior replacement. There should be minimal impact on most users. See our migration guide and don’t hesitate to reach out on GitHub if you run into issues.

Updated requirements

CCCL 3.0 now requires C++17 or newer
Supported host compilers: GCC 7+, Clang 14+, MSVC 2019+

Jetson

CUDA 13 introduces support for an open source GPU driver for Jetson, where we migrate from the traditional mobile-RM to the open source GPU driver starting with the Thor SoC. This unification will enable future concurrent usage of integrated GPUs (iGPU) and discrete GPUs (dGPU) on Jetson and IGX platforms, providing a seamless and efficient computing experience.

Starting with CUDA 13.0, Thor-based Jetson platforms will support Unified Virtual Memory (UVM) and full coherence. This also enables the device to access pageable host memory via the host’s page tables. The GPU access to this CPU-cached memory is also cached on GPU, with full coherence managed by the hardware interconnect. This means that system-allocated memory via mmap() or malloc() can be used directly on the GPU.

CUDA 13.0 brings Green contexts to Jetson. Green contexts are lightweight contexts that improve determinism by pre-assigning GPU resources to contexts. This feature provides a degree of resource isolation, allowing each context to run without interference from others.

CUDA 13 also brings enhancements to developer tools for Jetson, including support for libraries like NVML and nv-smi. These tools provide developers with better insights and control over GPU resources, enabling more efficient and effective development.

All of these features and more will be available on the JetPack 7.0 release on Thor. Stay tuned for the upcoming “CUDA on JetPack 7” blog post for more details.

Feature-complete architectures

The GPU architectures prior to Turing (compute capability 7.5) are considered to be feature-complete and as such, we’ve removed support for offline compilation of GPUs of compute capability prior to 7.5 in CUDA 13.0. Additionally, the R580 branch of the NVIDIA Driver will be the last driver branch to support these architectures.

This means that application developers looking to support pre-7.5 architectures will need to use CUDA 12.9 or earlier to build the applications, and users of these applications will need to stay on the 580 driver branch. All previously released CUDA Toolkits can be obtained from our CUDA download archive page. Note that the 580 branch is a long-term support branch that will be maintained and supported for three years. We provide a more detailed discussion of these changes in a blog post.

Updated vector types

double4, long4, ulong4, longlong4, and ulonglong4 are vector data types originally introduced with 16-byte alignment. The Blackwell architecture introduces new support for 256-bit loads/stores. Using 32-byte alignment for these types can provide performance improvements for memory-bound applications.

To facilitate seamless use of these vector types with 32-byte aligned boundaries, we are adding additional vector types that explicitly specify the alignment. If you wish to use these vector types with 32-byte alignment, replace double4 with double4_32a, for example. And if you wish to continue using these vector types with 16-byte alignment, replace double4 with double4_16a, for instance.

Starting with CUDA 13.0, the use of the vector types double4, long4, ulong4, longlong4, and ulonglong4 will include a deprecation warning encouraging you to replace these types with _16a or _32a variants, depending on your alignment requirements.

Changes to cudaDeviceProp

CUDA 13.0 introduces changes to the members of the cudaDeviceProperties struct. This struct is often used by developers to query the capabilities of the GPU and then make runtime decisions. The table below indicates the removed members of the struct and the replacement API or field that should be used instead, if it exists. For more information, see the release notes and the cudaDeviceProp information in the CUDA Runtime API.

Removed Field	Replacement API
c`lockRate`	`cudaDeviceGetAttribute(cudaDevAttrClockRate)`
`deviceOverlap`	`Use the asyncEngineCount field`
`kernelExecTimeoutEnabled`	cudaDeviceGetAttribute(cudaDevAttrKernelExecT`imeout)`
`computeMode`	cudaDeviceGetAttribute(cudaDevAttrComputeMo`de)`
`maxTexture1DLinear`	`cudaDeviceGetTexture1DLinearMaxWidth()`
`memoryClockRate`	`cudaDeviceGetAttribute(cudaDevAttrMemoryClockRate)`
`singleToDoublePrecisionPerfRatio`	`cudaDeviceGetAttribute(cudaDevAttrSingleToDoublePrecisionPerfRatio)`
`cooperativeMultiDeviceLaunch`	No replacement available

Table 2. CUDA 13.0 changes to the cudaDeviceProp struct, detailing their old and new values.

Summary

As a major release, CUDA Toolkit 13.0 lays the foundation for a new tile-based programming model that will enhance programmer productivity and performance on our latest (and future) hardware. And CUDA 13.0 continues to provide enhanced support for the newest NVIDIA GPUs with accelerated libraries, compilers, and developer tools.

Want more information? Check out the CUDA documentation, browse the latest NVIDIA Deep Learning Institute (DLI) offerings, and visit the NGC catalog. Ask questions and join the conversation in the CUDA Developer Forums.

Acknowledgments

Thanks to the following NVIDIA contributors: Andy Terrel, Jake Hemstad, Becca Zandstein, Mridula Prakash, Jackson Marusarz, Emma Smith, and Rekha Mukund.

What’s New and Important in CUDA Toolkit 13.0

Blackwell GPUs supported by CUDA 13.0

What’s in CUDA 13.0 beyond