# GTC 2020: A Faster Radix Sort Implementation

#### After clicking “Watch Now” you will be prompted to login or join.

### A Faster Radix Sort Implementation

#### Andrey Adinets, NVIDIA

##### GTC 2020

We'll present a faster implementation of the least-significant-digit radix sort. Decoupled look-back is used to reduce the number of memory operations per partition pass from 3N to 2N. A faster partition implementation inside the thread block is used. For 32-bit keys, we use four partition passes, with 8 bits per digit. On V100 sorting 64 million random UInt32 keys, our implementation achieves the speed of 16 Gkey/s, which is more than 2x faster than cub::DeviceRadixSort.