Register Cache: Caching for Warp-Centric CUDA Programs

Nadeem Mohammad, posted Oct 12 2017

In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive.

Cooperative Groups: Flexible CUDA Thread Programming

Nadeem Mohammad, posted Oct 04 2017

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize.

