Figure 1: The Tesla V100 Accelerator with Volta GV100 GPU. SXM2 Form Factor.
Technical Walkthrough 0

Using CUDA Warp-Level Primitives

NVIDIA GPUs execute groups of threads known as warps in SIMT (Single Instruction, Multiple Thread) fashion. Many CUDA programs achieve high performance by… 16 MIN READ
Register Cachel
Technical Walkthrough 0

Register Cache: Caching for Warp-Centric CUDA Programs

In this post we introduce the "register cache", an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software… 16 MIN READ
Technical Walkthrough 0

Cooperative Groups: Flexible CUDA Thread Programming

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. 16 MIN READ
Technical Walkthrough 0

CUDA 9 Features Revealed: Volta, Cooperative Groups and More

The CUDA 9 release includes support for Volta GPUs, Cooperative Groups programming model extensions, faster libraries, and improved developer tools. 17 MIN READ