What’s New in CUDA

CUDA 9.1

CUDA 9.1 brings new algorithms and optimizations that speed up AI and HPC apps on Volta GPUs. With this release you can:

  • Develop image augmentation algorithms for deep learning easily with new functions in NVIDIA Performance Primitives
  • Run batched neural machine translations and sequence modeling operations on Volta Tensor cores using new APIs in cuBLAS
  • Solve large 2D and 3D FFT problems more efficiently on multi-GPU systems with new heuristics in cuFFT
  • Launch kernels up to 12x faster with new core optimizations
CUDA 9.1 also includes compiler optimizations, support for new developer tool versions and bug fixes.


CUDA 9 is the most powerful software platform for GPU-accelerated applications. It has been built for Volta GPUs and includes faster GPU-accelerated libraries, a new programming model for flexible thread management, and improvements to the compiler and developer tools. With CUDA 9 you can speed up your applications while making them more scalable and robust.

Release Highlights

2X - 5X




Key Features

  • Speed up high performance computing (HPC) and deep learning apps with new GEMM kernels in cuBLAS
  • Execute image and signal processing apps faster with performance optimizations across multiple GPU configurations in cuFFT and NVIDIA Performance Primitives
  • Solve linear and graph analytics problems common in HPC with new algorithms in cuSOLVER and nvGRAPH
Cooperative Groups
  • Express rich parallel algorithms with threads from sub-tiles to warps, blocks and grids
  • Manage and reuse threads efficiently within an application with new API and function primitives
  • Replace warp-synchronous programming with robust programming model on Kepler architecture and above
Volta Architecture
  • Execute AI applications faster with Tensor Cores performing 5X faster than Pascal GPUs
  • Scale multi-GPU applications with next generation NVLink delivering 2X throughput of prior generation
  • Increase GPU utilization with Volta Multi-Process Service (MPS)
Development Tools
  • Optimize and pre-fetch memory access by identifying source code causing page faults in unified memory
  • Profile NVLink efficiently by adding events to timeline and color coding connections
  • Inspect unified memory performance bottlenecks with new event filters based on virtual address, migration reason and page fault access type
See Release Notes for details.

CUDA 9 Features Revealed

Learn about new features in CUDA 9 including updates to the programming model, computing libraries and development tools.

Inside Volta

Learn about new technologies and features introduced in the NVIDIA Volta GPU architecture.

Cooperative Groups

Learn about the new CUDA parallel programming model for managing threads in scalable applications.

Optimizing Performance With CUDA 9

Learn about new profiling capabilities in CUDA 9 for Volta GPUs and technologies such as Unified Memory and NVLink.

Archived Releases

Pascal Architecture Support

  • Enhance performance out-of-the-box on Pascal GPUs
  • Simplify programming using Unified Memory including support for large datasets, concurrent data access and atomics
  • Optimize Unified Memory performance using new data migration APIs
  • Increase throughput at ultra-fast speeds using NVIDIA® NVLINK™, new high-speed interconnect

Development Tools

  • Identify latent system-level bottlenecks using critical path analysis
  • Improve productivity by up to 2x with faster NVCC compile times
  • Tune OpenACC applications and overall host code using new profiling extensions


  • Accelerate graph analytics algorithms with nvGRAPH
  • Speed-up Deep Learning applications using native support for FP16 and INT8, support for batch operation in cuBLAS

See Release Notes for details.

Latest News

NVIDIA JetPack 3.2 Production Release Now Available

JetPack 3.2 with L4T R28.2 is the latest production software release for NVIDIA Jetson TX2, Jetson TX2i and Jetson TX1.

NVIDIA’s 2017 Open-Source Deep Learning Frameworks Contributions

Many may not know, NVIDIA is a significant contributor to the open-source deep learning community. How significant? Let’s reflect and explore the highlights and volume of activity from last year.

Using CUDA Warp-Level Primitives

NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Many CUDA programs achieve high performance by taking advantage of warp execution.

Hybridizer: High-Performance C# on GPUs

Hybridizer is a compiler from Altimesh that lets you program GPUs and other accelerators from C# code or .NET Assembly.

Blogs: Parallel ForAll

Introduction to NVIDIA RTX and DirectX Raytracing

“Ray tracing is the future, and it always will be!” has been the tongue-in-cheek phrase used by graphics developers for decades when asked whether real-time ray tracing will ever be feasible.

NVIDIA Highlights Unity Tutorial

NVIDIA Highlights enables players to capture in-game moments automatically based on events occurring during gameplay. Highlights represents a key feature in NVIDIA’s ShadowPlay automated screen capture software.

Solving SpaceNet Road Detection Challenge With Deep Learning

It’s that time again — SpaceNet raised the bar in their third challenge to detect road-networks in overhead imagery around the world.

The Peak-Performance Analysis Method for Optimizing Any GPU Workload

Figuring out how to reduce the GPU frame time of a rendering application on a PC can be a challenging task, even for the most experienced PC game developers.