Features, Cooperative Groups, CUDA, Optimization
In this post we introduce the “register cache”, an optimization technique that develops a virtual caching layer for threads in a single warp. It is a software abstraction implemented on top of the NVIDIA GPU shuffle primitive.
Features, Algorithms, Cooperative Groups, CUDA, Parallel Programming
In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize.