Technical Walkthrough 0

Cooperative Groups: Flexible CUDA Thread Programming

In efficient parallel algorithms, threads cooperate and share data to perform collective computations. To share data, the threads must synchronize. 16 MIN READ
Technical Walkthrough 0

Cutting Edge Parallel Algorithms Research with CUDA

Leyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par… 14 MIN READ
Technical Walkthrough 0

Voting and Shuffling to Optimize Atomic Operations

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to… 10 MIN READ
Technical Walkthrough 0

GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical… 9 MIN READ
GPU Pro Tip
Technical Walkthrough 0

CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

This post introduces warp-aggregated atomics, a useful technique to improve performance when many CUDA threads atomically update a single counter. 14 MIN READ
Technical Walkthrough 0

Faster Parallel Reductions on Kepler

Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for… 12 MIN READ