By enabling CUDA kernels to be written in Python similar to how they can be implemented within C++, Numba bridges the gap between the Python ecosystem and the performance of CUDA.
However, CUDA C++ developers have access to many libraries that presently have no exposure in Python. These include the CUDA Core Compute Libraries (CCCL), cuRAND, and header-based implementations of numeric types like bfloat16, to name a few.
While each CUDA C++ library can introduce itself to Python in its own way, manually making bindings for each library is laborious, repetitive, and prone to inconsistency. For example, the float16 and bfloat16 data types define over 60 similar standalone functions and similar bindings would be needed multiple times for both types.
In addition, manually created bindings often fall out of sync when the underlying CUDA C++ library introduces new features.
Solution: Numbast
Numbast establishes an automated pipeline that converts CUDA C/C++ APIs into Numba bindings.
At a high level, top-level declarations are read from CUDA C++ header files, serialized, and passed to Python APIs. Numba binding generators then iterate through these declarations and generate Numba extensions for each of the APIs.
Demo: C++ declaration of a simple struct
To show Numbast in action, the following example shows how Numba bindings are created for a demo myfloat16
type. These C++ declarations are inspired by those in the CUDA float16
header, giving a simplified version to demonstrate the binding generation in practice.
C++ declaration
This demo shows the following elements in C++ syntax:
// demo.cuh
struct __attribute__((aligned(2))) myfloat16
{
public:
half data;
__host__ __device__ myfloat16();
__host__ __device__ myfloat16(double val);
__host__ __device__ operator float() const;
};
__host__ __device__ myfloat16 operator+(const myfloat16 &lh, const myfloat16 &rh);
__host__ __device__ myfloat16 hsqrt(const myfloat16 a);
- A struct declaration, which has
- Device constructors
- A few device methods, including conversion and arithmetic operators
- Two function declarations: arithmetic operator overload and a square root function.
For more information about the language features supported in Numbast, see Supported CUDA C++ declarations.
Setup script with Numbast
Numbast usage normally involves two steps:
- Parsing header files with
AST_Canopy
. - Generating Numba bindings from parsed headers.
The following code example sets up the Numba bindings by implementing these two steps:
import os
from ast_canopy import parse_declarations_from_source
from numbast import bind_cxx_struct, bind_cxx_function, MemoryShimWriter
from numba import types, cuda
from numba.core.datamodel.models import PrimitiveModel
import numpy as np
# Step 1:
# Use `AST_Canopy` to parse demo.cuh as AST, read all declarations from it.
source = os.path.join(os.path.dirname(__file__), "demo.cuh")
# Assume that you want to generate bindings for a machine with "sm_80"
# capability.
structs, functions, *_ = parse_declarations_from_source(source, [source], "sm_80")
shim_writer = MemoryShimWriter(f'#include "{source}"')
# Step 2:
# Make Numba bindings from the declarations.
# New type "myfloat16" is a Number type, data model is `PrimitiveModel`.
myfloat16 = bind_cxx_struct(shim_writer, structs[0], types.Number, PrimitiveModel)# bind_cxx_function returns the generated bindings to the C++ declaration.# The first function binds to an operator, and it’s bound to `operator.add`. You can directly use `myfloat16 + myfloat16` in kernels.
bind_cxx_function(shim_writer, functions[0])# The second function is `hsqrt`, with which Numbast creates a new Python handle and returns it in the return value.
hsqrt = bind_cxx_function(shim_writer, functions[1])
Data models are the different ways that Numba represents the underlying data. The PrimitiveModel
model that is used for myfloat16
is well-suited for scalars. A StructModel
model (not used here) is useful for classes and structs. Other data models are less commonly used.
Usage in the most natural way
In CUDA C++, you can construct a myfloat16 object and use it as follows:
__global__ void kernel()
{
auto one = myfloat16(1.0);
auto two = myfloat16(2.0);
auto three = one + two;
auto sqrt3 = hsqrt(three);
}
In Numba kernels, you use them as-is:
@cuda.jit(link=shim_writer.links())
def kernel():
one = myfloat16(1.0)
two = myfloat16(2.0)
three = one + two
sqrt3 = hsqrt(three)
Thanks to type inference in Numba, the code is even cleaner than the original C++.
First supported bindings: bfloat16 data type
The first Numba binding supported through Numbast is a new bfloat16
data type. The bfloat16
data type can interoperate with PyTorch’s torch.bfloat16
data type, so that you can efficiently develop custom compute kernels with this new data type.
The following code example shows how to use the new bfloat16
data type to develop a Numba kernel that performs computation on torch tensors. It passes a PyTorch array of torch.bfloat16
type into a Numba compute kernel and performs math operations using CUDA intrinsics bound through Numbast.
from numba import float32
import numba.cuda as cuda
import torch
from numbast_extensions.bf16 import get_shims, hsin, nv_bfloat16
@cuda.jit(link=get_shims())
def torch_add_sin(a, b, out):
i, j = cuda.grid(2)
if i < out.shape[0] and j < out.shape[1]:
# Arithmetic of bfloat16 type
sum = a[i, j] + b[i, j]
# Bfloat16 native intrinsics
sin_of_sum = hsin(sum)
# bf16 to f32 upcast
f32 = float32(sin_of_sum)
# f32 to bf16 downcast
bf16 = nv_bfloat16(f32)
# Assignment to external array
out[i, j] = bf16
a = torch.ones([2, 2], device=torch.device("cuda:0"), dtype=torch.bfloat16)
b = torch.ones([2, 2], device=torch.device("cuda:0"), dtype=torch.bfloat16)
expected = torch.sin(a + b)
out = torch.zeros([2, 2], device=torch.device("cuda:0"), dtype=torch.bfloat16)
threadsperblock = (16, 16)
blockspergrid = (1, 1)
torch_add_sin[blockspergrid, threadsperblock](a, b, out)
assert torch.equal(expected, out)
You can download Numbast and the bfloat16
Numba bindings from conda-forge
:
conda install -c nvidia -c rapidsai -c conda-forge ml_dtypes numbast-extensions
Architecture
Numbast consists of two components:
AST_Canopy
: An underlying layer that parses and serializes C++ headers- Numbast: A user-facing layer that consumes the parsed results and builds Numba bindings dynamically.
AST_Canopy: declaration parser
In forest ecology, a canopy refers to the upper layer of a forest habitat zone. AST_Canopy
is the package that inspects the top-level declarations in the forest of abstract syntax trees, extracting information from them and passing them to the Python layer. Here, top level refers to the public, user-facing APIs that a CUDA C++ library exposes to users.

Figure 1 shows the architecture of AST_Canopy
:
clangTooling
: a Clang library used to support writing standalone tools, such as Numbast.libastcanopy
: Implements the declaration-parsing logic usingclangTooling
.pylibastcanopy
: A binding that directly exposeslibastcanopy
APIs in Python.AST_Canopy
: A layer on top ofpylibastcanopy
that provides a pleasantly Pythonic user experience.
In addition to header parsing and serialization, AST_Canopy
also provides the following features:
- Environment detection at runtime: Auto-detects the
libstdcxx
and CUDA headers installed through Conda packages, and sets up theclang
compilers accordingly. - Flexibility in compute capability parsing: Enables configuring AST parsing based on different compute capabilities. Some headers conditionally expose code based on compute capability, and this feature supports cases where the header serialization and runtime environments are different.
Numbast: binding generator
Numbast is downstream to AST_Canopy, which consumes the declaration information and generates Numba bindings automatically. At a high level, Numbast exists to provide a translation layer between C++ and Python syntax. As the demo suggests, most simple C++ syntax finds its natural counterpart in Python (Table 1).
Operation | CUDA C++ | Numba |
Object construction | auto hpi = myfloat16(3.14) | hpi = myfloat16(3.14) |
Attribute access | auto data = hpi.data | data = hpi.data |
Function invocation | auto r = hsqrt(hpi) | r = hsqrt(hpi) |
Type casts | auto fpi = float(hpi); | fpi = types.float32(hpi) |
Arithmetic operations | auto pi2 = hpi + hpi | pi2 = hpi + hpi |
Numba’s type system has much in common with C and C++-like languages. Of course, there are also features not present in Python, such as pointer semantics and template-based meta-programming.
Numbast is the middle layer that encapsulates the similarities and differences between C++ and Python.
Lowering: The bigger picture
Bindings generated using Numbast are lowered with a feature in Numba called foreign function invocation (FFI). Numba ABI-compatible shim functions wrapping over the native CUDA function call are generated and then compiled with NVRTC. Expect the same optimized performance that a CUDA C++ developer enjoys, minus the performance of FFI.
Future versions of Numba-cuda will introduce link time optimization (LTO) support, further eliminating the performance gap between accelerated Numba kernel and native CUDA C++.
Caveats
Both AST_Canopy
and Numbast have caveats worth noting. AST_Canopy
depends on clangTooling
. Both new CUDA language features that are not yet supported by clangTooling
and libraries that depend on the new language feature may not be correctly parsed. However, the majority of libraries and headers make use of features supported by clangTooling
.
Conclusion
In this post, we introduced a new Numba binding generation tool, Numbast. We showed that, by using Numbast, you can quickly benefit from the ever-growing feature set of CUDA C++.
Numbast v0.1.0 provides the new data type bfloat16
to Numba. You can expect to see more bindings generated by Numbast, including new data types, NVSHMEM bindings, and CCCL bindings.