The latest release of the CUDA Toolkit, version 12.8, continues to push accelerated computing performance in data sciences, AI, scientific computing, and computer graphics and simulation, using the latest NVIDIA CPUs and GPUs. This post highlights some of the new features and enhancements included with this release:
- NVIDIA Blackwell architecture support
- CUDA Graphs conditional nodes enhancements
- Blackwell CUTLASS kernels for large language models (LLMs)
- NVIDIA Nsight Developer Tools updates
- Math libraries updates
cudaStreamGetDevice
- Compiler updates
- Accelerated Python updates
- Feature-complete architectures
NVIDIA Blackwell architecture support
CUDA Toolkit 12.8 is the first version of the Toolkit to support the NVIDIA Blackwell architecture across the entire suite of Developer Tools including performance tools and profilers, libraries, and compilers. Built with 208 billion transistors—more than 2.5x the number of transistors in NVIDIA Hopper GPUs—Blackwell is the largest GPU ever built.
Key Blackwell capabilities supported include:
- Second-generation Transformer Engine through custom Tensor Core technology: Accelerates inference and training for LLMs and mixture-of-experts (MoE) models.
- Decompression: Accelerates performance on data analytics and data science pipelines using the latest compression formats such as LZ4, Snappy, and Deflate.
- Network interconnect: NVLink and NVLink Switches accelerate inter-GPU communications performance for trillion-parameter and multitrillion-parameter AI models.
To learn more about the leading innovations in Blackwell, see the NVIDIA Blackwell Architecture Technical Brief.
2x faster CUDA Graphs with runtime kernel selection for lower latency inference
With Blackwell, CUDA Graphs APIs continue to be the most efficient way to launch repeated invocations of sequences of GPU operations. CUDA Toolkit 12.8 introduces more enhancements to CUDA Graphs, including additional conditional node types.
In many applications, having dynamic control over the execution of work in CUDA Graphs can increase performance and flexibility of graph launches. For example, an algorithm that involves iterating over a series of operations many times until the result converges below a certain threshold can now run wholly on the GPU without needing CPU control management, reducing overhead by as much as 2x. CUDA Toolkit 12.8 improves APIs for runtime control of conditional graph nodes.
Conditional nodes contain segments of a graph that can execute, or be skipped, based on a condition to evaluate as the graph is running. Such segments can be evaluated once (an IF node), or repeatedly in a loop (a WHILE node). CUDA 12.8 adds support for two new types of conditional graph nodes: IF/ELSE combined nodes and SWITCH nodes.
With the Blackwell architecture, we’ve improved LLM performance to benefit all reasoning models, including DeepSeek-R1. CUDA Graphs enhanced SWITCH and IF/ELSE support delivers 2x more performance for runtime kernel selection versus going back to the CPU for launch decision-making.
- Training: By reducing CPU dependency for kernel selection, training workloads sustain even more GPU Tensor Core throughput, resulting in higher Model FLOPs Utilization (MFU). This improves performance using the same GPU infrastructure, reducing time and cost to train.
- Inference: For next-generation reasoning models that make use of test-time compute, a high token generation rate is critical as each inference request can generate a vast number of tokens per query. CUDA 12.8 new stream API enables fewer calls back to the host CPU, reducing the time between one kernel finishing and the next one starting, increasing token generation rate. This results in more tokens generated in fixed time budget, helping models reason more and increasing intelligence.
To learn more, see Dynamic Control Flow in CUDA Graphs with Conditional Nodes.
Blackwell CUTLASS kernels for LLMs
CUTLASS, since its 2017 debut, has been instrumental for researchers and developers implementing high-performance CUDA kernels on NVIDIA GPUs. By providing developers with comprehensive tools to design custom operations, such as GEMMs and Convolutions, CUTLASS has been critical for the development of hardware-aware algorithms, powering breakthroughs like FlashAttention that helped spark modern AI.
With the release of CUTLASS 3.8—which supports CUDA 12.8—NVIDIA is extending support to the Blackwell architecture, enabling developers to harness next-generation Tensor Cores with support for all new data types. This includes new narrow precision MX formats and the NVIDIA-developed FP4 format, which increase compute throughput. Figure 1 shows CUTLASS can achieve up to 98% relative peak performance for Tensor Core operations.

For DeepSeek-V3 and DeepSeek-R1, grouped GEMMs make up a large portion of the MoE compute required during inference. These operations enable different matrix sizes, scaling factors, and fusions to be grouped and parallelized in a single persistent-kernel launch. With CUTLASS, on Blackwell with FP4, Grouped GEMM kernel performance increases by up to 5x over H200 with FP16.

NVIDIA Nsight Developer Tools
NVIDIA Nsight Compute 2025.1 is the first official release with support for the Blackwell architecture. Updates include visualization of Blackwell Hardware Tensor Memory in the memory chart as well as Tensor Core performance data.

It also comes with several improvements to the increasingly popular range profiling feature. Users can now collect source-level metrics, including Instructions Executed and memory access information, inside profiled ranges. This update also enables Guided Analysis rules evaluation for ranges. This built-in expertise for identifying performance issues is a key component of NVIDIA Nsight Compute. This release reports kernel stack sizes and adds custom tooltips to help users understand their workload performance.
This release of Compute Sanitizer, an automatic correctness checking tool, adds support for Python call stacks to accurately locate kernel correctness issues when kernels are launched through Python applications. Additionally, new Tensor Core MMA guardrails for Blackwell can report errors related to Tensor Core programming. These are enabled by adding the PTXAS flag -g-tmem-access-check
when compiling programs. Examples of common errors include access to unallocated tensor memory, invalid addresses, and invalid allocator usage.
Math libraries updates
With CUDA Toolkit 12.8, we have several new library enhancements that leverage the new Blackwell architecture and help accelerate applications in AI, data sciences, graphics and simulation, and high-performance scientific computing.
New features
- cuBLAS
- APIs were extended to support microscaled 4-bit and 8-bit floating point mixed-precision tensor core accelerated matrix multiplication for compute capability 10.0 (Blackwell) and higher.
- Introduced initial support for CUDA in Graphics (CIG) on Windows x64 for NVIDIA Ampere GPU architecture and Blackwell GeForce-class GPUs. CIG contexts are now autodetected, and cuBLAS selects kernels that comply with CIG shared memory usage limits.
- cuSOLVER now supports zsytrf/zsytrs, a complex symmetric direct solver without pivoting.
- nvJPEG now provides support for the Tegra architecture.
- NPP now provides support for the DRIVE Thor architecture.
cudaStreamGetDevice
Applications often use CUDA streams to provide ordered access to GPU resources. An instance of a CUDA stream is associated with a fixed CUDA device. In applications that address multiple devices, there are scenarios where getting a handle to the underlying device for a given stream is useful to tailor the application to device characteristics.
Previously, the CUDA API did not provide a mechanism for retrieving the device associated with a CUDA stream; developers had to track this themselves. The addition of the cudaStreamGetDevice
CUDA API to retrieve the device associated with a CUDA stream can simplify applications.
Compiler updates
New compiler updates include the following:
- The CUDA Toolkit 12.8 release introduces support for GCC 14 as a host-side compiler.
- The default high-level optimizer is now based on LLVM 18 for the Blackwell architecture.
nvdisasm
now supports emitting JSON formatted SASS disassembly.
Accelerated Python updates
The following two beta releases are now available for Python users:
- CUDA Python has released an early prototype of a new idiomatic object model called
cuda.core
and moved the CUDA binding to a submodule,cuda.bindings
. For more information, see the documentation in the NVIDIA/cuda-python GitHub repo.
- CUDA Core Compute Libraries (CCCL) has released early prototypes of Python for parallel and cooperative algorithms, enabling you to use thread-level parallelism with user-defined types and functions from pure Python code. Learn more about CCCL.
Additionally, the CuPy team is releasing a new version with Blackwell patches validated for general availability.
Feature-complete architectures
With the CUDA Toolkit 12.8 release, we now consider the Maxwell, Pascal, and Volta architectures to be feature-complete and support for them will be frozen in an upcoming release.
This means that, in future releases, no new features will be added to the driver to enable new CUDA Toolkit functionality supporting Maxwell, Pascal, and Volta architectures. End users will be able to run existing software stacks and applications on Maxwell, Pascal, and Volta architectures using the supported upcoming LTS driver branch through its lifecycle.
Starting with release 12.8, developers running offline compilers targeting these architectures will output a warning message when using nvcc
, nvrtc
, and nvjitlink
.
In the next major CUDA Toolkit release, offline compilation support for the Maxwell, Pascal, and Volta architectures will be removed from the compilers. The upcoming LTS driver for production application execution and JIT compilation of Maxwell, Pascal, and Volta applications will be supported for the normal 3-year LTS support window.
For more details, read the CUDA Toolkit 12.8 Release Notes.
Summary
The CUDA Toolkit 12.8 release provides full feature support for the NVIDIA Blackwell architecture. This release continues to provide enhanced support for the newest NVIDIA GPUs, accelerated libraries, compilers, and Developer Tools, whether you’re developing applications in C++ or Python.
Want more information? Check out the CUDA documentation, browse the latest NVIDIA Deep Learning Institute (DLI) offerings, and visit the NGC catalog. Ask questions and join the conversation in the CUDA Developer Forums.
Acknowledgments
Thanks to the following NVIDIA contributors: Stephen Jones, Jackson Marusarz, Becca Zandstein, Andy Terrel, Ashraf Eassa, Matt Nicely, and Mridula Prakash.