Programming Tensor Cores in CUDA 9

Features, Deep Learning, Linear Algebra, Mixed Precision, Tensor Cores, Volta

Nadeem Mohammad, posted Oct 17 2017

A defining feature of the new Volta GPU Architecture is its Tensor Cores, which give the Tesla V100 accelerator a peak throughput 12 times the 32-bit floating point throughput of the previous-generation Tesla P100.

Read more

Register Cache: Caching for Warp-Centric CUDA Programs

Features, Cooperative Groups, CUDA, Optimization

Nadeem Mohammad, posted Oct 12 2017

In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive.

Read more

Mixed-Precision Training of Deep Neural Networks

Features, Deep Learning, FP16, Mixed Precision, Tensor Cores, Volta

Nadeem Mohammad, posted Oct 11 2017

Deep Neural Networks (DNNs) have lead to breakthroughs in a number of areas, including image processing and understanding, language modeling, language translation, speech processing, game playing, and many others.

Read more

Training AI for Self-Driving Vehicles: the Challenge of Scale

Features, Autonomous Vehicles, Deep Learning, DGX-1, Volta

Nadeem Mohammad, posted Oct 09 2017

Modern deep neural networks, such as those used in self-driving vehicles, require a mind boggling amount of computational power.

Read more

Cooperative Groups: Flexible CUDA Thread Programming

Features, Algorithms, Cooperative Groups, CUDA, Parallel Programming

Nadeem Mohammad, posted Oct 04 2017

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize.

Read more