Streamlining CUB with a Single-Call API

The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code.

This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1, which simplifies development by managing memory under the hood without sacrificing performance.

What is CUB?

If you need to run a standard algorithm (such as scan, histogram, or sort) on a GPU, CUB is likely the fastest way to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL), CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance.

While libraries like Thrust provide a high-level, “host-side” interface similar to the C++ Standard Template Library (STL) for quick prototyping, CUB provides a set of “device-side” primitives. This enables developers to integrate highly optimized algorithms directly into their own custom kernels. To learn how to use CUB, check out the NVIDIA DLI course Fundamentals of Accelerated Computing with Modern CUDA C++.

The existing CUB two-phase API

CUB is widely recommended for harnessing the full computational capabilities of NVIDIA GPUs. Nevertheless, it carries some intricacies in its usage that may feel non-trivial. This section takes a step back to put these underlying mechanisms in perspective.

A straightforward, single-pass execution flow is often assumed, where a single-call to a function primitive suffices to execute the underlying algorithm and retrieve the results right after. The function’s side effects, such as modifying a variable or returning a result, are expected to be immediately visible to the next statement.

The CUB execution model diverges from this familiar single-pass pattern. Invoking a CUB primitive is a two-step process that requires first calculating the necessary device memory size (the first call), and second, explicitly allocating and then executing the kernel (the second call).

The following is a conventional CUB call:

// FIRST CALL: determine temporary storage size
cub::DeviceScan::ExclusiveSum(nullptr, temp_storage_bytes, d_input, d_output, num_items);

// Allocate the required temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// SECOND CALL: run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_input, d_output, num_items);

The CUB interface introduces a practical challenge. The primitives must be invoked twice: first to determine the amount of temporary memory needed, and then a second time to execute the actual algorithm with the allocated storage.

A significant drawback of the traditional two-phase API is the lack of clarity regarding which arguments must remain consistent between the estimation and execution steps. Taking the snippet above for reference, it’s not programmatically clear which parameters influence the internal state and can change between the calls, because the function signatures for both phases are identical. For example, the d_input and d_output arguments are only actually being used during the second call.

Despite its intricacies, the existing design serves the following fundamental purpose: by keeping allocation separated from execution, the user can allocate a chunk of memory and reuse it multiple times or even share between different algorithms.

While this design is important for a non-negligible subset of users, the overall user base leveraging this feature is rather limited. That is why many users wrap their CUB calls, to abstract away the two-step invocation required for every use. PyTorch is a case in point, which employs macros to wrap its CUB invocations into single-calls and provide automatic memory management.

The following source code is from the pytorch/pytorch GitHub repo:

// handle the temporary storage and 'twice' calls for cub API
#define CUB_WRAPPER(func, ...) do {                                       \
  size_t temp_storage_bytes = 0;                                          \
  AT_CUDA_CHECK(func(nullptr, temp_storage_bytes, __VA_ARGS__));          \
  auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get();    \
  auto temp_storage = caching_allocator.allocate(temp_storage_bytes);     \
  AT_CUDA_CHECK(func(temp_storage.get(), temp_storage_bytes, __VA_ARGS__));\
} while (false)

The use of macros presents its own drawbacks, as they can obscure control flow and parameter passing, resulting in opaque code that is difficult to understand and significantly hinders debugging.

The new single-call CUB API

Given the wide usage of wrappers throughout many production codebases, there is a recognized need to extend CUB by introducing the new single-call API:

// SINGLE CALL: allocation and execution on a single step
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items);

The example shows that no explicit memory allocation is required. Note that the allocation process is still occurring under the hood, nevertheless. Figure 1 shows that the single-call interface—which includes temporary storage estimation, memory allocation, and invoking the algorithm—introduces zero overhead compared to the two-phase API.

Bar chart comparing the normalized time of Exclusive Sum for both single and two-phase algorithms. The chart shows no performance degradation using the single-phase algorithm. — *Figure 1. Normalized time versus number of elements for exclusive sum, comparing the single and two-phase algorithms*

Figure 1 compares the GPU runtime for the original two-phase ExclusiveSum call against the newly introduced single-phase call. The x-axis represents multiple input sizes, while the y-axis shows the normalized execution time for each type of invocation. Two major conclusions drawn from this performance data:

The new API introduces zero overhead
Memory allocation remains under the new API; it just happens under the hood

The second point can be verified by peeking inside the implementation of the new API. Asynchronous allocation is embedded within the device primitive:

cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, env = {}) {
	. . .
	d_temp_storage = mr.allocate(stream, bytes);
	mr.deallocate(stream, d_temp_storage, bytes);
	. . . 
 }

The two-phase APIs have not been removed—those are still valid calls of existing CUB APIs. Rather, the single-phase calls are added on top of existing APIs. It’s expected that the majority of users will use these.

The environment and memory resources

Beyond resolving the problems previously mentioned, the new single-call CUB API also expands the execution configuration capabilities of the invoked primitive. It introduces an environment argument, which can either customize memory allocation using memory resources or just provide a stream to execute on (like the two-phase API).

Memory resources are a new memory utility for allocating and freeing memory. The environment argument to single-call APIs can optionally contain a memory resource. When a memory resource is not provided using the environment argument, the API will use a default memory resource provided by CCCL. Conversely, you can elect to pass one of the non-default CCCL memory resources provided as part of the codebase, or even pass your own custom memory resource.

// Use CCCL-provided  memory resource type
cuda::device_memory_pool mr{cuda::devices[0]};
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, mr);

// Create and use your custom MR
my_memory_resource my_mr{cuda::experimental::devices[0]};
// Use it with CUB
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, my_mr);

With the new API, the execution CUDA stream handling is not alleviated but rather encapsulated within the new env variable. Of course it can also be passed explicitly as before, even if temp allocation handling is removed. CUB now also provides cuda::stream_ref that is type safe and its usage should be preferred. You can also pass cuda::stream which owns the underlying execution stream.

Combining execution options

The single-call API enables more than just passing a memory resource or a stream as a last argument. Going forward, the environment argument will be the place for all execution-related knobs, including deterministic requirements, guarantees, user-defined tunings, and much more.

With the introduction of the single-pass API, CUB has unlocked a massive suite of execution configuration features. With the plethora of new execution features, the question becomes: What is the best way to combine them all?

The solution lies in the new env argument. By leveraging cuda::std::execution, CUB provides a central endpoint that acts as a flexible “control panel” for your algorithm. Instead of rigidly defined function arguments, the environment enables you to create a combinatorial mix of any features you need. Whether you want to pair a custom stream with a specific memory pool, or combine strict deterministic requirements with a custom tuning policy, the env argument handles it all in a single, type-safe object.

cuda::stream custom_stream{cuda::device_ref{0}};
auto memory_prop  = cuda::std::execution::prop{cuda::mr::get_memory_resource,          
cuda::device_default_memory_pool(cuda::device_ref{0})};
auto env = cuda::std::execution::env{custom_stream.get(), memory_prop};
DeviceScan::ExclusiveSum(d_input, d_output, num_items, env);

CUB currently provides the following algorithms that support the environment interface, with more to come:

cub::DeviceReduce::Reduce
cub::DeviceReduce::Sum
cub::DeviceReduce::Min/Max/ArgMin/ArgMax
cub::DeviceScan::ExclusiveSum
cub::DeviceScan::ExclusiveScan

For up-to-date progress of the new environment based overloads, see the CUB device primitives tracking issue on the NVIDIA/cccl GitHub repo.

Get started with CUB

By replacing the verbose two-phase pattern with a streamlined single-call interface, CUB offers a modern API that eliminates boilerplate without adding overhead. By leveraging the extensible env argument, you gain a unified control panel to seamlessly mix memory resources, streams, and other facilities. You’re encouraged to adopt this new standard to simplify your codebase and fully harness the computational power of your GPU. Download CUDA 13.1 or later and start using these single-call APIs.

Streamlining CUB with a Single-Call API

What is CUB?

The existing CUB two-phase API

The new single-call CUB API

The environment and memory resources

Combining execution options

Get started with CUB

Tags

About the Authors

Streamlining CUB with a Single-Call API

What is CUB?

The existing CUB two-phase API

The new single-call CUB API

The environment and memory resources

Combining execution options

Get started with CUB

Tags

About the Authors

Comments

Related posts

Topping the GPU MODE Kernel Leaderboard with NVIDIA cuda.compute

Delivering the Missing Building Blocks for NVIDIA CUDA Kernel Fusion in Python

Accelerating GPU Applications with NVIDIA Math Libraries

Support for CUDA Unified Memory Now Available in Thrust

Drop-in Acceleration of GNU Octave

Related posts

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

Accelerating AI-Powered Chemistry and Materials Science Simulations with NVIDIA ALCHEMI Toolkit-Ops

Democratizing Large-Scale Mixture-of-Experts Training with NVIDIA PyTorch Parallelism

Scale Biology Transformer Models with PyTorch and NVIDIA BioNeMo Recipes

Streamline CUDA-Accelerated Python Install and Packaging Workflows with Wheel Variants