What’s New in CUDA

CUDA 10.1

CUDA 10.1 includes a new lightweight GEMM library, new functionality and performance updates to existing libraries, and improvements to the CUDA Graphs APIs.

With CUDA 10.1, you get:

  • cuBLASLt, a new lightweight GEMM library with a flexible API and tensor core support for INT8 inputs and FP16 CGEMM split-complex matrix multiplication
  • New selective eigensolvers SYEVDX and SYGVDX in cuSOLVER, and performance improvements of up to 1.5X for full spectrum eigensolvers
  • New encoding and batched decoding functionalities in nvJPEG
  • Up to 4X faster performance for broad set of random number generators in cuRAND
  • Improved performance and support for fork/join kernels in CUDA Graphs APIs

Additionally, CUDA 10.1 includes bug fixes, support for new operating systems, and updates to the Nsight Systems and Nsight Compute developer tools. CUDA 10.1 is available for download today!

Download Now

See Release Notes for additional details.

“Since CUDA 9.0, Red Hat has collaborated closely with NVIDIA to bring the power of the CUDA development platform across Red Hat’s portfolio of open hybrid cloud technologies. With the release of CUDA 10.1, we’re pleased to continue this collaboration as we work with NVIDIA on bringing additional ease-of-use, scale and choice to users building AI and HPC applications on enterprise-grade, open foundations, including Red Hat Enterprise Linux and Red Hat OpenShift Container Platform”

Chris Wright, Vice President and Chief Technology Officer at Red Hat, Inc. Red Hat logo


CUDA 10 is the most powerful software development platform for building GPU-accelerated applications. It has been built for Turing GPUs and includes performance optimized libraries, a new asynchronous task-graph programming model, enhanced CUDA & graphics API interoperability, and new developer tools. CUDA 10 also provides all the components needed to build applications for NVIDIA's most powerful server platforms for AI and high performance computing (HPC) workloads, both on-prem (DGX-2) and in the cloud (HGX-2).

Key Features


  • New GPU architecture: Build and optimize applications for the next generation of Turing GPUs
  • Tensor Cores
  • NVSwitch Fabric


  • CUDA Graphs: A new asynchronous task-graph programming model which enables more efficient kernel launch and execution
  • CUDA/Graphics Interop: New interoperability between CUDA and graphics APIs, including Vulkan and DX12
  • Warp Matrix


  • nvJPEG: New library for hybrid JPEG processing that provides >2x speedup on single and batched image decoding
  • Performance Optimized Libraries: Strong FFT performance scaling across 16-GPU systems, acceleration of dense linear algebra routines such as Eigensolver and Cholesky factorization, and Turing optimized mixed-precision GEMM performance


  • New Developer Tools: New Nsight product family of tools for tracing, profiling, and debugging of CUDA applications (Nsight Systems and Nsight Compute)

Release Highlights

cuFFT 10.0 - Upto 17TF performance on 16-GPUs 3D 1K FFT

cuBLAS 10.0 - Upto 90TF of GEMM performance

cuSOLVER 10.0 - Upto 4x faster on symmetric eigensolver

All library benchmarks use NVIDIA Tesla V100 (or where specified P100) GPUs and Intel Skylake 6140 Gold 2.3 GHz processors

Learn More

CUDA 10 Features Revealed

Learn about new features in CUDA 10 including updates to the programming model, computing libraries, and development tools.

Inside Turing

Learn about new technologies and features introduced in the NVIDIA Turing GPU architecture.

Nsight Systems

Learn more about the performance analysis tool designed to provide software optimization insights

Nsight Compute

Learn more about the interactive CUDA API debugging and kernel profiling tool

Archived Releases

Volta Architecture Support

  • Execute AI applications faster with Tensor Cores performing 5X faster than Pascal GPUs
  • Scale multi-GPU applications with next generation NVLink delivering 2X throughput of prior generation
  • Increase GPU utilization with Volta Multi-Process Service (MPS)

Development Tools

  • Optimize and pre-fetch memory access by identifying source code causing page faults in unified memory
  • Profile NVLink efficiently by adding events to timeline and color coding connections
  • Inspect unified memory performance bottlenecks with new event filters based on virtual address, migration reason and page fault access type


  • Speed up high performance computing (HPC) and deep learning apps with new GEMM kernels in cuBLAS
  • Execute image and signal processing apps faster with performance optimizations across multiple GPU configurations in cuFFT and NVIDIA Performance Primitives
  • Solve linear and graph analytics problems common in HPC with new algorithms in cuSOLVER and nvGRAPH

Cooperative Groups

  • Express rich parallel algorithms with threads from sub-tiles to warps, blocks and grids
  • Manage and reuse threads efficiently within an application with new API and function primitives
  • Replace warp-synchronous programming with robust programming model on Kepler architecture and above

See Release Notes archive for details.

Latest News

NVIDIA Webinars: Hello AI World and Learn with JetBot

We recently announced two exciting upcoming webinars about the new Jetson Nano. Each presentation will be followed by a live Q&A session where you can ask questions in real-time with the NVIDIA Jetson team. We look forward to you joining us!

Using MATLAB and TensorRT on NVIDIA GPUs

MathWorks recently released MATLAB R2018b which integrates with NVIDIA TensorRT through GPU Coder. With this integration, scientists and engineers can achieve faster inference performance on GPUs from within MATLAB.

AWS Optimizes TensorFlow on NVIDIA Tensor Core GPUs

Earlier this week, Amazon announced new AWS Deep Learning AMIs tuned for high-performance training with NVIDIA Tensor Core GPUs on Amazon EC2 instances.


At GTC Silicon Valley in San Jose, NVIDIA released CUDA-X AI, a collection of NVIDIA’s GPU acceleration libraries that accelerate deep learning, machine learning, and data analysis.

Blogs: Parallel ForAll

Using VRworks in the Cloud with Pixvana SPIN Studio

Pixvana’s cloud-based VR pipeline now incorporates the NVIDIA VRWorks 360 Video SDK. Pixvana strives to solve a number of challenges facing the VR creator by leveraging the power of cloud computing.

Machine Learning Acceleration in Vulkan with Cooperative Matrices

Machine learning harnesses computing power to solve a variety of ‘hard’ problems that seemed impossible to program using traditional languages and techniques. Machine learning avoids the need for a programmer to explicitly program the steps in sol

DGX-2 Server Virtualization Leverages NVSwitch for Faster GPU Enabled Virtual Machines

NVIDIA Kernel-based Virtual Machine (KVM) takes open source KVM and enhances it to support the unique capabilities of the NVIDIA DGX-2 server, creating a full virtualization solution for NVIDIA GPUs and NVIDIA NVSwitch devices with PCI passthrough

Tensor Core Programming Using CUDA Fortran

The CUDA Fortran compiler from PGI now supports programming Tensor Cores with NVIDIA’s Volta V100 and Turing GPUs. This enables scientific programmers using Fortran to take advantage of FP16 matrix operations accelerated by Tensor Cores.