Developer Tools & Techniques

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

The leaderboard scores how fast users’ custom GPU kernels solve a set of standard problems like vector addition, sorting, and matrix multiply.

Python dominates machine learning for its ergonomics, but writing truly fast GPU code has historically meant dropping into C++ to write custom kernels and to maintain bindings back to Python. For most Python developers and researchers, this is a significant barrier to entry.

Frameworks like PyTorch address this by implementing kernels in CUDA C++—either handwritten or by leveraging libraries like the NVIDIA CUDA Core Compute Libraries. Handwritten kernels are time-consuming and require deep, low-level architectural expertise. Using CUB, a C++ library within CCCL, is often better, since its primitives are highly optimized per architecture and are rigorously tested. But exposing CUB to Python traditionally means building and maintaining bindings and pre-instantiating C++ templates with fixed types and operators—limiting flexibility on the Python side. 

The NVIDIA cuda.compute library overcomes these limitations by offering a high-level, Pythonic API for device-wide CUB primitives. 

Using cuda.compute helped an NVIDIA CCCL team top the GPU MODE leaderboard, a kernel competition hosted by an online community with more than 20,000 members and a focus on learning and improving GPU programming. GPU MODE hosts the kernel competitions to find the best implementations for a variety of tasks, from simple vector addition to more complex block matrix multiplications. 

The NVIDIA CCCL team focuses on delivering “speed-of-light” (SOL) implementations of parallel primitives across GPU architectures through high-level abstractions. It achieved the most first-place finishes overall on the tested GPU architectures: NVIDIA B200, NVIDIA H100, NVIDIA A100, and NVIDIA L4. 

In this blog we’ll share more details about how we were able to place so high on the leaderboard.

CUDA Python: GPU performance meets productivity

CUB offers highly optimized CUDA kernels for common parallel operations, including those featured in the GPU MODE competition. These kernels are architecturally tuned and widely considered near speed-of-light implementations.

The cuda.compute library supports custom types and operators defined directly in Python. Under the hood, it just-in-time (JIT) compiles specialized kernels and applies link-time optimization to deliver near-SOL performance on par with CUDA C++. You stay in Python while getting the flexibility of templates and the performance of tuned CUDA kernels.

With cuda.compute you get:

  1. Fast, composable CUDA workflows in Python: Develop efficient and modular CUDA applications directly within Python.
  2. Custom data types and operators: Utilize custom data types and operators without the need for C++ bindings.
  3. Optimized performance: Achieve architecture-aware performance through proven CUB primitives.
  4. Rapid iteration: Accelerate development with JIT compilation while maintaining CUDA C++ levels of performance. JIT compilation accelerates the development cycle by providing the flexibility and rapid iteration cycles that developers need without compromising performance.

The leaderboard results

Using cuda.compute, we submitted entries across GPU MODE benchmarks for PrefixSum, VectorAdd, Histogram, Sort, and Grayscale (look for username Nader).

For algorithms like sort, the CUB implementation was two-to-four times faster than the next best submission. This is the CCCL promise in action: SOL‑class algorithms that outperform custom kernels for standard primitives you’d otherwise spend months building.

Where we didn’t take first place, the gap typically came down to us not having a tuning policy for that specific GPU. In some instances, our implementation was a more general solution, while higher-ranked submissions were specialized to specific problem sizes. 

In other cases, the first place submission was already using CUB or cuda.compute under the hood. This underscores that these libraries already represent the performance ceiling for many standard GPU algorithms, and that their performance characteristics are now well understood and intentionally relied upon by leading submissions.

This isn’t about winning

Leaderboard results are a byproduct; the real objective is learning with the community, benchmarking transparently, and demonstrating the power of Python for high-performance GPU work.

Our goal isn’t to discourage hand-written CUDA kernels. There are plenty of valid cases for custom kernels—novel algorithms, tight fusion, or specialized memory access patterns—but for standard primitives (sort, scan, reduce, histogram, etc.), your first move should be a proven, high-performance implementation. With cuda.compute, those tuned CUB primitives are now accessible directly from native Python, allowing you to build high-quality, production-grade, GPU-accelerated Python libraries. 

This is great news for anyone building the next CuPy, RAPIDS component, or a custom Python GPU accelerated library: faster iteration, fewer glue layers, and production-grade performance all while staying in pure Python.

How cuda.compute looks in practice

One of the first examples any person writes when learning GPU programming is a vector addition. Using cuda.compute we can solve this using pure Python by calling a device-wide primitive.

import cuda.compute
from cuda.compute import OpKind

# Build-time tensors (used to specialize the callable)
build_A = torch.empty(2, 2, dtype=torch.float16, device="cuda")
build_B = torch.empty(2, 2, dtype=torch.float16, device="cuda")
build_out = torch.empty(2, 2, dtype=torch.float16, device="cuda")

# JIT compiling the transform kernel
transform = cuda.compute.make_binary_transform(build_A, build_B, build_out, OpKind.PLUS)

# Defining custom_kernel is required to submit to the GPU MODE competition
def custom_kernel(data):
    # Invoking our transform operation on some input data
    A, B, out = data
    transform(A, B, out, A.numel())
    return out

You can find more cuda.compute examples on the GPU MODE Leaderboard. The pattern is consistent: simple code with speed-of-light performance, achieved by calling device-wide building blocks that are automatically optimized by CCCL for every GPU generation.

Other top-performing submissions for the VectorAdd category required dropping into C++ and inline PTX, resulting in code that is highly architecture-dependent.

Try cuda.compute today

If you’re building Python GPU software, custom pipelines, library components, or performance-sensitive code, cuda.compute gives you the option to use CCCL CUB primitives directly in Python and leverage building blocks designed for architecture-aware speed-of-light performance.

To try cuda.compute today, you can install it via pip or conda:

pip install cuda-cccl[cu13] (or [cu12])

conda install -c conda-forge cccl-python cuda-version=12 (or 13)

We’re building this with the community—your feedback and benchmarks shape our roadmap so don’t hesitate to reach out to us on Github or in the GPU MODE discord.

Discuss (0)

Tags