What’s New in CUDA


CUDA 9 is the fastest software platform for GPU-accelerated applications. It has been built for Volta GPUs and provides faster GPU-accelerated libraries, improvements to the programming model, compiler and developer tools. With CUDA 9 you can speed up your applications while making them more scalable and robust.

Download the CUDA 9 Release Candidate (RC) today to try the latest release.

CUDA 9 Features Revealed

Learn about new features in CUDA 9 including updates to the programming model, computing libraries and development tools.

Inside Volta

Learn about new technologies and features introduced in the NVIDIA Volta GPU architecture.

Cooperative Groups

Learn about the new CUDA parallel programming model for managing threads in scalable applications.

Optimizing Application Performance With CUDA 9

Learn about new profiling capabilities in CUDA 9 for Volta GPUs and technologies such as Unified Memory and NVLink.


CUDA 8 presents major improvements to the memory model, profiling tools, and new libraries. Using CUDA 8, you can improve performance, simplify memory usage, profile and debug your application more efficiently.

What’s New in CUDA 8 Webinar

Release Highlights

Perform 2X faster out of the box with Pascal GPUs

Solve larger problems with Unified Memory

Increase application throughput with FP16 and INT8 support

New GPU-accelerated NVGraph library for Graph Analytics

Key Features

Pascal Architecture Support
  • Enhance performance out-of-the-box on Pascal GPUs
  • Simplify programming using Unified Memory including support for large datasets, concurrent data access and atomics
  • Optimize Unified Memory performance using new data migration APIs
  • Increase throughput at ultra-fast speeds using NVIDIA® NVLINK™, new high-speed interconnect
Developer Tools
  • Identify latent system-level bottlenecks using critical path analysis
  • Improve productivity by up to 2x with faster NVCC compile times
  • Tune OpenACC applications and overall host code using new profiling extensions
  • Accelerate graph analytics algorithms with nvGRAPH
  • Speed-up Deep Learning applications using native support for FP16 and INT8, support for batch operation in cuBLAS

Customer Quotes

"That's a great work, guys! I like CUDA Toolkit more and more. Hope the bugs I submitted will be fixed by release. I used EA for checking performance of my applications and find the way to optimize them. I found, that having OpenACC marks is very useful. I tested both remote profiling and local one. I helped me. Other elements and counters helped me to find some other rooms to speedup my application. Thanks!"

Alexey Romanenko – Novosibirsk State University

"Shows lots of promise, looks like it is going to be a great library, few more tools and it will be great :-) Also more examples and bit more documentation. Looking forward keep using it"

Vicente Cuellar – Wave Crafters

Additional Resources

CUDA 8 and Beyond

Learn about new features in CUDA 8, NVIDIA’s vision for CUDA and challenges facing the future of parallel software development.

CUDA 8 Performance Overview

Learn how updates to the CUDA Toolkit improve the performance of GPU-accelerated applications.

Developer Tools in CUDA 8

Learn about new profiling capabilities in CUDA 8.

Debugging Tools in CUDA 8

Learn about the the latest updates to debugging tools in CUDA 8.

Latest News

PGI 17.7 Delivers OpenACC and CUDA Fortran for Volta GPUs

PGI compilers & tools are used by scientists and engineers who develop applications for high-performance computing (HPC) systems.

Gradient Boosting, Decision Trees and XGBoost with CUDA

Gradient boosting is a powerful machine learning algorithm used to achieve state-of-the-art accuracy on a variety of tasks such as regression, classification and ranking.

Building Cross-Platform CUDA Applications with CMake

Cross-platform software development poses a number of challenges to your application’s build process. How do you target multiple platforms without maintaining multiple platform-specific build scripts, projects, or makefiles?

Developer Spotlight: Creating Photorealistic CGI Environments

Get to know Rense de Boer, a technical art director from Sweden, who is not only pushing the envelope of photo-real CGI environments, but he’s doing it all in a real-time engine!

Blogs: Parallel ForAll

Gradient Boosting, Decision Trees and XGBoost with CUDA

Gradient boosting is a powerful machine learning algorithm used to achieve state-of-the-art accuracy on a variety of tasks such as regression, classification and ranking.

Pro Tip: Linking OpenGL for Server-Side Rendering

Visualization is a great tool for understanding large amounts of data, but transferring the data from an HPC system or from the cloud to a local workstation for analysis can be a painful experience.

Scaling Keras Model Training to Multiple GPUs

Keras is a powerful deep learning meta-framework which sits on top of existing frameworks such as TensorFlow and Theano. Keras is highly productive for developers; it often requires 50% less code to define a model than native APIs of deep learning

Deep Learning Hyperparameter Optimization with Competing Objectives

In this post we’ll show how to use SigOpt’s Bayesian optimization platform to jointly optimize competing objectives in deep learning pipelines on NVIDIA GPUs more than ten times faster than traditional approaches like random search.