Register Cache: Caching for Warp-Centric CUDA Programs

Features, Cooperative Groups, CUDA, Optimization

Nadeem Mohammad, posted Oct 12 2017

In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive.

Read more

Cooperative Groups: Flexible CUDA Thread Programming

Features, Algorithms, Cooperative Groups, CUDA, Parallel Programming

Nadeem Mohammad, posted Oct 04 2017

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize.

Read more