The C++ template library CUB is a go-to for high-performance GPU primitive algorithms, but its traditional “two-phase” API, which separates memory estimation from allocation, can be cumbersome. While this programming model offers flexibility, it often results in repetitive boilerplate code.
This post explains the shift from this API to the new CUB single-call API introduced in CUDA 13.1, which simplifies development by managing memory under the hood without sacrificing performance.
What is CUB?
If you need to run a standard algorithm (such as scan, histogram, or sort) on a GPU, CUB is likely the fastest way to do it. As a principal component of the NVIDIA CUDA Core Compute Libraries (CCCL), CUB is designed to abstract away the complexity of manual CUDA thread management without sacrificing performance.
While libraries like Thrust provide a high-level, “host-side” interface similar to the C++ Standard Template Library (STL) for quick prototyping, CUB provides a set of “device-side” primitives. This enables developers to integrate highly optimized algorithms directly into their own custom kernels. To learn how to use CUB, check out the NVIDIA DLI course Fundamentals of Accelerated Computing with Modern CUDA C++.
The existing CUB two-phase API
CUB is widely recommended for harnessing the full computational capabilities of NVIDIA GPUs. Nevertheless, it carries some intricacies in its usage that may feel non-trivial. This section takes a step back to put these underlying mechanisms in perspective.
A straightforward, single-pass execution flow is often assumed, where a single-call to a function primitive suffices to execute the underlying algorithm and retrieve the results right after. The function’s side effects, such as modifying a variable or returning a result, are expected to be immediately visible to the next statement.
The CUB execution model diverges from this familiar single-pass pattern. Invoking a CUB primitive is a two-step process that requires first calculating the necessary device memory size (the first call), and second, explicitly allocating and then executing the kernel (the second call).
The following is a conventional CUB call:
// FIRST CALL: determine temporary storage size
cub::DeviceScan::ExclusiveSum(nullptr, temp_storage_bytes, d_input, d_output, num_items);
// Allocate the required temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// SECOND CALL: run the actual scan
cub::DeviceScan::ExclusiveSum(d_temp_storage, temp_storage_bytes, d_input, d_output, num_items);
The CUB interface introduces a practical challenge. The primitives must be invoked twice: first to determine the amount of temporary memory needed, and then a second time to execute the actual algorithm with the allocated storage.
A significant drawback of the traditional two-phase API is the lack of clarity regarding which arguments must remain consistent between the estimation and execution steps. Taking the snippet above for reference, it’s not programmatically clear which parameters influence the internal state and can change between the calls, because the function signatures for both phases are identical. For example, the d_input and d_output arguments are only actually being used during the second call.
Despite its intricacies, the existing design serves the following fundamental purpose: by keeping allocation separated from execution, the user can allocate a chunk of memory and reuse it multiple times or even share between different algorithms.
While this design is important for a non-negligible subset of users, the overall user base leveraging this feature is rather limited. That is why many users wrap their CUB calls, to abstract away the two-step invocation required for every use. PyTorch is a case in point, which employs macros to wrap its CUB invocations into single-calls and provide automatic memory management.
The following source code is from the pytorch/pytorch GitHub repo:
// handle the temporary storage and 'twice' calls for cub API
#define CUB_WRAPPER(func, ...) do { \
size_t temp_storage_bytes = 0; \
AT_CUDA_CHECK(func(nullptr, temp_storage_bytes, __VA_ARGS__)); \
auto& caching_allocator = *::c10::cuda::CUDACachingAllocator::get(); \
auto temp_storage = caching_allocator.allocate(temp_storage_bytes); \
AT_CUDA_CHECK(func(temp_storage.get(), temp_storage_bytes, __VA_ARGS__));\
} while (false)
The use of macros presents its own drawbacks, as they can obscure control flow and parameter passing, resulting in opaque code that is difficult to understand and significantly hinders debugging.
The new single-call CUB API
Given the wide usage of wrappers throughout many production codebases, there is a recognized need to extend CUB by introducing the new single-call API:
// SINGLE CALL: allocation and execution on a single step
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items);
The example shows that no explicit memory allocation is required. Note that the allocation process is still occurring under the hood, nevertheless. Figure 1 shows that the single-call interface—which includes temporary storage estimation, memory allocation, and invoking the algorithm—introduces zero overhead compared to the two-phase API.

Figure 1 compares the GPU runtime for the original two-phase ExclusiveSum call against the newly introduced single-phase call. The x-axis represents multiple input sizes, while the y-axis shows the normalized execution time for each type of invocation. Two major conclusions drawn from this performance data:
- The new API introduces zero overhead
- Memory allocation remains under the new API; it just happens under the hood
The second point can be verified by peeking inside the implementation of the new API. Asynchronous allocation is embedded within the device primitive:
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, env = {}) {
. . .
d_temp_storage = mr.allocate(stream, bytes);
mr.deallocate(stream, d_temp_storage, bytes);
. . .
}
The two-phase APIs have not been removed—those are still valid calls of existing CUB APIs. Rather, the single-phase calls are added on top of existing APIs. It’s expected that the majority of users will use these.
The environment and memory resources
Beyond resolving the problems previously mentioned, the new single-call CUB API also expands the execution configuration capabilities of the invoked primitive. It introduces an environment argument, which can either customize memory allocation using memory resources or just provide a stream to execute on (like the two-phase API).
Memory resources are a new memory utility for allocating and freeing memory. The environment argument to single-call APIs can optionally contain a memory resource. When a memory resource is not provided using the environment argument, the API will use a default memory resource provided by CCCL. Conversely, you can elect to pass one of the non-default CCCL memory resources provided as part of the codebase, or even pass your own custom memory resource.
// Use CCCL-provided memory resource type
cuda::device_memory_pool mr{cuda::devices[0]};
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, mr);
// Create and use your custom MR
my_memory_resource my_mr{cuda::experimental::devices[0]};
// Use it with CUB
cub::DeviceScan::ExclusiveSum(d_input, d_output, num_items, my_mr);
With the new API, the execution CUDA stream handling is not alleviated but rather encapsulated within the new env variable. Of course it can also be passed explicitly as before, even if temp allocation handling is removed. CUB now also provides cuda::stream_ref that is type safe and its usage should be preferred. You can also pass cuda::stream which owns the underlying execution stream.
Combining execution options
The single-call API enables more than just passing a memory resource or a stream as a last argument. Going forward, the environment argument will be the place for all execution-related knobs, including deterministic requirements, guarantees, user-defined tunings, and much more.
With the introduction of the single-pass API, CUB has unlocked a massive suite of execution configuration features. With the plethora of new execution features, the question becomes: What is the best way to combine them all?
The solution lies in the new env argument. By leveraging cuda::std::execution, CUB provides a central endpoint that acts as a flexible “control panel” for your algorithm. Instead of rigidly defined function arguments, the environment enables you to create a combinatorial mix of any features you need. Whether you want to pair a custom stream with a specific memory pool, or combine strict deterministic requirements with a custom tuning policy, the env argument handles it all in a single, type-safe object.
cuda::stream custom_stream{cuda::device_ref{0}};
auto memory_prop = cuda::std::execution::prop{cuda::mr::get_memory_resource,
cuda::device_default_memory_pool(cuda::device_ref{0})};
auto env = cuda::std::execution::env{custom_stream.get(), memory_prop};
DeviceScan::ExclusiveSum(d_input, d_output, num_items, env);
CUB currently provides the following algorithms that support the environment interface, with more to come:
- cub::DeviceReduce::Reduce
- cub::DeviceReduce::Sum
- cub::DeviceReduce::Min/Max/ArgMin/ArgMax
- cub::DeviceScan::ExclusiveSum
- cub::DeviceScan::ExclusiveScan
For up-to-date progress of the new environment based overloads, see the CUB device primitives tracking issue on the NVIDIA/cccl GitHub repo.
Get started with CUB
By replacing the verbose two-phase pattern with a streamlined single-call interface, CUB offers a modern API that eliminates boilerplate without adding overhead. By leveraging the extensible env argument, you gain a unified control panel to seamlessly mix memory resources, streams, and other facilities. You’re encouraged to adopt this new standard to simplify your codebase and fully harness the computational power of your GPU. Download CUDA 13.1 or later and start using these single-call APIs.