After clicking “Watch Now” you will be prompted to login or join.


Click “Watch Now” to login or join the NVIDIA Developer Program.


A Faster Radix Sort Implementation

Andrey Adinets, NVIDIA

GTC 2020

We'll present a faster implementation of the least-significant-digit radix sort. Decoupled look-back is used to reduce the number of memory operations per partition pass from 3N to 2N. A faster partition implementation inside the thread block is used. For 32-bit keys, we use four partition passes, with 8 bits per digit. On V100 sorting 64 million random UInt32 keys, our implementation achieves the speed of 16 Gkey/s, which is more than 2x faster than cub::DeviceRadixSort.

View More GTC 2020 Content