NVIDIA announces the newest release of the CUDA development environment, CUDA 11.6. This release is focused on enhancing the programming model and performance of your CUDA applications. CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, visualization, AI, ML and DL, and data science.
CUDA 11.6 has several important features. This post offers an overview of the key capabilities:
- GSP driver architecture now default on Turing and Ampere GPUs
- New API to allow disabling nodes in instantiated graph
- Full support of 128-bit integer type
- Cooperative groups namespace update
- CUDA compiler update
- Nsight Compute 2022.1 release
CUDA 11.6 ships with the R510 driver, an update branch. CUDA 11.6 Toolkit is available to download.
GSP driver architecture
The GSP driver architecture is now the default driver mode for all listed Turing and Ampere GPUs. The older driver architecture is supported as a fallback. For more information, see R510 Driver Readme.
Instantiated Graph Node API additions
We added a new API, cudaGraphNodeSetEnabled, to allow disabling nodes in an instantiated graph. Support is limited to kernel nodes in this release.  A corresponding API, cudaGraphNodeGetEnabled, allows querying the enabled state of a node. We’ve also added the ability to disable NULL kernel graph node launches.
128-bit integer support
CUDA 11.6 includes the full release of 128-bit integer (__int128) data type, including compiler and developer tools support. The host-side compiler must support the __int128 type to use this feature.
Cooperative groups namespace
The cooperative groups namespace has been updated with new functions to improve consistency in naming, function scope, and unit dimension and size.
| Implicit Group/Member | Threads | Blocks | 
| thread_block:: | dim_threadsnum_threadsthread_rankthread_index | (Not needed) | 
| grid_group:: | num_threadsthread_rank | dim_blocksnum_blocksblock_rankblock_index | 
CUDA compiler
- Added -arch=nativecompilation option to target installed GPUs during compilation. This extends the existing-gencode=arch=compute_xx,code=sm_xxarchitecture specification
- Add the ability to create PTX files from nvlink
Deprecated features
- The cudaDeviceSynchronize()used for on-device fork and join parallelism is deprecated in preparation for a replacement programming model with higher performance. These functions continue to work in this release, but the tools emit a warning about the upcoming change.
- CentOS Linux 8 has reached End-of-Life on Dec 31, 2021, and support for this OS is now deprecated in the CUDA Toolkit. CentOS Linux 8 support will be completely removed in a future release.
Additional resources
- GTC sessions: - CUDA New Features and Beyond, by Stephen Jones
- Nearly Effortless CUDA Graphs, by Rob Van der Wijngaart and Jiajie Yao
- A Deep Dive Into the Latest HPC Software, by Tim Costa
- Multi-GPU Programming Models, by Jiri Kraus
 
- Blog posts:
 
         
           
           
           
     
     
     
     
     
     
     
    