GTC 2020: A Faster Radix Sort Implementation
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
A Faster Radix Sort Implementation
Andrey Adinets, NVIDIA
We'll present a faster implementation of the least-significant-digit radix sort. Decoupled look-back is used to reduce the number of memory operations per partition pass from 3N to 2N. A faster partition implementation inside the thread block is used. For 32-bit keys, we use four partition passes, with 8 bits per digit. On V100 sorting 64 million random UInt32 keys, our implementation achieves the speed of 16 Gkey/s, which is more than 2x faster than cub::DeviceRadixSort.