Simulation / Modeling / Design

CUDA Toolkit 11.8 New Features Revealed

Oct 04, 2022

By Rob Armstrong, Rob Nertney, Rekha Mukund and Fred Oh

Discuss (4)

AI-Generated Summary

Dislike

The CUDA 11.8 Toolkit release enhances the programming model and CUDA application speedup through new hardware capabilities, particularly with NVIDIA Hopper and Ada Lovelace architectures.
CUDA 11.8 introduces lazy module loading, which allows for faster loading of functions and libraries on the CPU, reducing memory footprint, but may introduce minimal latency when functions are first loaded.
The release includes various developer tool updates, such as Nsight Compute and Nsight Systems, to help identify and correct performance issues, including new features for profiling and debugging NVIDIA Hopper thread block clusters.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through new hardware capabilities.

New architecture-specific features in NVIDIA Hopper and Ada Lovelace are initially being exposed through libraries and framework enhancements. The full programming model enhancements for the NVIDIA Hopper architecture will be released starting with the CUDA Toolkit 12 family.

CUDA 11.8 has several important features. This post offers an overview of the key capabilities.

NVIDIA Hopper and NVIDIA Ada architecture support

CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families.

CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.

Lazy module loading

Building on the lazy kernel loading feature in 11.7, NVIDIA added lazy loading to the CPU module side. What this means is that functions and libraries load faster on the CPU, with sometimes substantial memory footprint reductions. The tradeoff is a minimal amount of latency at the point in the application where the functions are first loaded. This is lower overall than the total latency without lazy loading.

All libraries used with lazy loading must be built with 11.7+ to be eligible for lazy loading.

Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.

Improved MPS signal handling

You can now terminate with SIGINT or SIGKILL any applications running in MPS environments without affecting other running processes. While not true error isolation, this enhancement enables more fine-grained application control, especially in bare-metal data center environments.

FP8 support in math libraries for H100 GPUs

cuBLASLt exposes mixed-precision multiplication operations with the new FP8 data types. These operations also support BF16 and FP16 bias fusions, as well as FP16 bias with GELU activation fusions for GEMMs with FP8 input and output data types. The CUDA Math API provides FP8 conversions to facilitate the use of the new FP8 matrix multiplication operations.

NVIDIA JetPack installation simplification

NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Starting from CUDA Toolkit 11.8, Jetson users on NVIDIA JetPack 5.0 and later can upgrade to the latest CUDA versions without updating the NVIDIA JetPack version or Jetson Linux BSP (board support package) to stay on par with the CUDA desktop releases.

For more information, see Simplifying CUDA Upgrades for NVIDIA Jetson Developers.

CUDA developer tool updates

Compute developer tools are designed in lockstep with the CUDA ecosystem to help you identify and correct performance issues.

Nsight Compute

In Nsight Compute, you can expose low-level performance metrics, debug API calls, and visualize workloads to help optimize CUDA kernels. New compute features are being introduced in CUDA 11.8 to aid performance tuning activity on the NVIDIA Hopper architecture.

You can now profile and debug NVIDIA Hopper thread block clusters, which provide performance boosts and increased control over the GPU. Cluster tuning is being released in combination with profiling support for the Tensor Memory Accelerator (TMA), the NVIDIA Hopper rapid data transfer system between global and shared memory.

A new sample is included in Nsight Compute for CUDA 11.8 as well. The sample provides source code and precollected results that walk you through an entire workflow to identify and fix an uncoalesced memory access problem. Explore more CUDA samples to equip yourself with the knowledge to use toolkit features and solve similar cases in your own application.

Nsight Systems

Profiling with Nsight Systems can provide insight into issues such as GPU starvation, unnecessary GPU synchronization, insufficient CPU parallelizing, and expensive algorithms across the CPUs and GPUs. Understanding these behaviors and the load of deep learning frameworks, such as PyTorch and TensorFlow, helps you tune your models and parameters to increase overall single or multi-GPU utilization.

Other tools

Also included in the CUDA toolkit, both CUDA-GDB for CPU and GPU thread debugging as well as Compute Sanitizer for functional correctness checking have support for the NVIDIA Hopper architecture.

Summary

This release of the CUDA 11.8 Toolkit has the following features:

First release supporting NVIDIA Hopper and NVIDIA Ada Lovelace GPUs
Lazy module loading extended to support lazy loading of CPU-side modules in addition to device-side kernels
Improved MPS signal handling for interrupting and terminating applications
NVIDIA JetPack installation simplification
CUDA developer tool updates

For more information, see the following resources:

Discuss (4)

About the Authors

About Rob Armstrong
Rob Armstrong is a principal technical product manager for the CUDA toolkit. For over 20 years he has focused on accelerating software with heterogeneous hardware platforms, and has particular interest in computer architecture and hardware/software interaction.

View all posts by Rob Armstrong

About Rob Nertney
Rob Nertney is a senior software architect for confidential computing. He has spent nearly 15 years architecting the features and deployment of accelerator hardware into hyperscale environments for both internal and external use by developers. He has several patents in processor design relating to secure solutions that are in production today. In his spare time, he loves golfing when the weather is nice, and gaming (on RTX hardware of course!) when the weather isn’t.

View all posts by Rob Nertney

About Rekha Mukund
Rekha Mukund is a senior product manager in the compute group at NVIDIA. She leads the CUDA Tegra product line for the Nvidia Jetson, NVIDIA DRIVE, and NVIDIA Shield TV platforms, and also oversees the CUDA OS support and OpenCL initiatives. Before joining NVIDIA, Rekha worked with Cisco for eight years in the PayTV technology domain. She is a gold medalist in B.E. computer science from the University Visvesvaraya College of Engineering (UVCE) in India, a national-level table tennis player, and a seasoned globetrotter.

View all posts by Rekha Mukund

About Fred Oh
Fred is a senior product marketing manager for CUDA, CUDA on WSL, and CUDA Python. Fred has a B.S. in Computer Science and Math from UC Davis. He began his career as a UNIX software engineer porting kernel services and device drivers to x86 architectures. He loves Star Wars, Star Trek and the NBA Warriors.

View all posts by Fred Oh