Simulation / Modeling / Design

CUDA Toolkit 11.8 New Features Revealed

NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through new hardware capabilities.

New architecture-specific features in NVIDIA Hopper and Ada Lovelace are initially being exposed through libraries and framework enhancements. The full programming model enhancements for the NVIDIA Hopper architecture will be released starting with the CUDA Toolkit 12 family.

CUDA 11.8 has several important features. This post offers an overview of the key capabilities.

NVIDIA Hopper and NVIDIA Ada architecture support

CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families.

CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.

Lazy module loading

Building on the lazy kernel loading feature in 11.7, NVIDIA added lazy loading to the CPU module side. What this means is that functions and libraries load faster on the CPU, with sometimes substantial memory footprint reductions. The tradeoff is a minimal amount of latency at the point in the application where the functions are first loaded. This is lower overall than the total latency without lazy loading.​

All libraries used with lazy loading must be built with 11.7+ to be eligible for lazy loading.

Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.

Improved MPS signal handling

You can now terminate with SIGINT or SIGKILL any applications running in MPS environments without affecting other running processes. While not true error isolation, this enhancement enables more fine-grained application control, especially in bare-metal data center environments.​

FP8 support in math libraries for H100 GPUs

cuBLASLt exposes mixed-precision multiplication operations with the new FP8 data types. These operations also support BF16 and FP16 bias fusions, as well as FP16 bias with GELU activation fusions for GEMMs with FP8 input and output data types. The CUDA Math API provides FP8 conversions to facilitate the use of the new FP8 matrix multiplication operations.

NVIDIA JetPack installation simplification

NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Starting from CUDA Toolkit 11.8, Jetson users on NVIDIA JetPack 5.0 and later can upgrade to the latest CUDA versions without updating the NVIDIA JetPack version or Jetson Linux BSP (board support package) to stay on par with the CUDA desktop releases.

For more information, see Simplifying CUDA Upgrades for NVIDIA Jetson Developers.

CUDA developer tool updates

Compute developer tools are designed in lockstep with the CUDA ecosystem to help you identify and correct performance issues.

Nsight Compute

In Nsight Compute, you can expose low-level performance metrics, debug API calls, and visualize workloads to help optimize CUDA kernels. New compute features are being introduced in CUDA 11.8 to aid performance tuning activity on the NVIDIA Hopper architecture.

You can now profile and debug NVIDIA Hopper thread block clusters, which provide performance boosts and increased control over the GPU. Cluster tuning is being released in combination with profiling support for the Tensor Memory Accelerator (TMA), the NVIDIA Hopper rapid data transfer system between global and shared memory.

A new sample is included in Nsight Compute for CUDA 11.8 as well. The sample provides source code and precollected results that walk you through an entire workflow to identify and fix an uncoalesced memory access problem. Explore more CUDA samples to equip yourself with the knowledge to use toolkit features and solve similar cases in your own application.

Nsight Systems

Profiling with Nsight Systems can provide insight into issues such as GPU starvation, unnecessary GPU synchronization, insufficient CPU parallelizing, and expensive algorithms across the CPUs and GPUs. Understanding these behaviors and the load of deep learning frameworks, such as PyTorch and TensorFlow, helps you tune your models and parameters to increase overall single or multi-GPU utilization.

Other tools

Also included in the CUDA toolkit, both CUDA-GDB for CPU and GPU thread debugging as well as Compute Sanitizer for functional correctness checking have support for the NVIDIA Hopper architecture.


This release of the CUDA 11.8 Toolkit has the following features:

  • First release supporting NVIDIA Hopper and NVIDIA Ada Lovelace GPUs
  • Lazy module loading extended to support lazy loading of CPU-side modules in addition to device-side kernels
  • Improved MPS signal handling for interrupting and terminating applications
  • NVIDIA JetPack installation simplification
  • CUDA developer tool updates

For more information, see the following resources:

Discuss (4)