Simulation / Modeling / Design

Bridging the CUDA C++ Ecosystem and Python Developers with Numbast

Oct 24, 2024

By Michael Yh Wang and Graham Markall

Discuss (0)

AI-Generated Summary

Dislike

Numbast is a tool that automates the generation of Numba bindings for CUDA C/C++ APIs, bridging the gap between Python and CUDA performance.
The tool consists of two main components: AST_Canopy, which parses and serializes C++ headers, and Numbast, which generates Numba bindings dynamically.
Numbast has been used to create bindings for the bfloat16 data type, allowing it to interoperate with PyTorch's torch.bfloat16 data type and enabling efficient development of custom compute kernels on NVIDIA hardware.
The bindings generated by Numbast are lowered using Numba's foreign function invocation (FFI) feature, providing optimized performance similar to native CUDA C++ code.

AI-generated content may summarize information incompletely. Verify important information. Learn more

By enabling CUDA kernels to be written in Python similar to how they can be implemented within C++, Numba bridges the gap between the Python ecosystem and the performance of CUDA.

However, CUDA C++ developers have access to many libraries that presently have no exposure in Python. These include the CUDA Core Compute Libraries (CCCL), cuRAND, and header-based implementations of numeric types like bfloat16, to name a few.

While each CUDA C++ library can introduce itself to Python in its own way, manually making bindings for each library is laborious, repetitive, and prone to inconsistency. For example, the float16 and bfloat16 data types define over 60 similar standalone functions and similar bindings would be needed multiple times for both types.

In addition, manually created bindings often fall out of sync when the underlying CUDA C++ library introduces new features.

Solution: Numbast

Numbast establishes an automated pipeline that converts CUDA C/C++ APIs into Numba bindings.

At a high level, top-level declarations are read from CUDA C++ header files, serialized, and passed to Python APIs. Numba binding generators then iterate through these declarations and generate Numba extensions for each of the APIs.

Demo: C++ declaration of a simple struct

To show Numbast in action, the following example shows how Numba bindings are created for a demo myfloat16 type. These C++ declarations are inspired by those in the CUDA float16 header, giving a simplified version to demonstrate the binding generation in practice.

C++ declaration

This demo shows the following elements in C++ syntax:

// demo.cuh
struct __attribute__((aligned(2))) myfloat16
{
public:
 half data;

 __host__ __device__ myfloat16();

 __host__ __device__ myfloat16(double val);

 __host__ __device__ operator float() const;
};
__host__ __device__ myfloat16 operator+(const myfloat16 &lh, const myfloat16 &rh);

__host__ __device__ myfloat16 hsqrt(const myfloat16 a);

A struct declaration, which has
- Device constructors
- A few device methods, including conversion and arithmetic operators
Two function declarations: arithmetic operator overload and a square root function.

For more information about the language features supported in Numbast, see Supported CUDA C++ declarations.

Setup script with Numbast

Numbast usage normally involves two steps:

Parsing header files with AST_Canopy.
Generating Numba bindings from parsed headers.

The following code example sets up the Numba bindings by implementing these two steps:

import os
from ast_canopy import parse_declarations_from_source
from numbast import bind_cxx_struct, bind_cxx_function, MemoryShimWriter

from numba import types, cuda
from numba.core.datamodel.models import PrimitiveModel

import numpy as np
# Step 1:
# Use `AST_Canopy` to parse demo.cuh as AST, read all declarations from it.
source = os.path.join(os.path.dirname(__file__), "demo.cuh")
# Assume that you want to generate bindings for a machine with "sm_80"
# capability.
structs, functions, *_ = parse_declarations_from_source(source, [source], "sm_80")

shim_writer = MemoryShimWriter(f'#include "{source}"')
# Step 2:
# Make Numba bindings from the declarations.
# New type "myfloat16" is a Number type, data model is `PrimitiveModel`.
myfloat16 = bind_cxx_struct(shim_writer, structs[0], types.Number, PrimitiveModel)# bind_cxx_function returns the generated bindings to the C++ declaration.# The first function binds to an operator, and it’s bound to `operator.add`. You can directly use `myfloat16 + myfloat16` in kernels.
bind_cxx_function(shim_writer, functions[0])# The second function is `hsqrt`, with which Numbast creates a new Python handle and returns it in the return value.
hsqrt = bind_cxx_function(shim_writer, functions[1])

Data models are the different ways that Numba represents the underlying data. The PrimitiveModel model that is used for myfloat16 is well-suited for scalars. A StructModel model (not used here) is useful for classes and structs. Other data models are less commonly used.

Usage in the most natural way

In CUDA C++, you can construct a myfloat16 object and use it as follows:

__global__ void kernel()
{
  auto one = myfloat16(1.0);
  auto two = myfloat16(2.0);
  auto three = one + two;
  auto sqrt3 = hsqrt(three);
}

In Numba kernels, you use them as-is:

@cuda.jit(link=shim_writer.links())
def kernel():
  one = myfloat16(1.0)
  two = myfloat16(2.0)
  three = one + two
  sqrt3 = hsqrt(three)

Thanks to type inference in Numba, the code is even cleaner than the original C++.

First supported bindings: bfloat16 data type

The first Numba binding supported through Numbast is a new bfloat16 data type. The bfloat16 data type can interoperate with PyTorch’s torch.bfloat16 data type, so that you can efficiently develop custom compute kernels with this new data type.

The following code example shows how to use the new bfloat16 data type to develop a Numba kernel that performs computation on torch tensors. It passes a PyTorch array of torch.bfloat16 type into a Numba compute kernel and performs math operations using CUDA intrinsics bound through Numbast.

from numba import float32
import numba.cuda as cuda
import torch
from numbast_extensions.bf16 import get_shims, hsin, nv_bfloat16


@cuda.jit(link=get_shims())
def torch_add_sin(a, b, out):
  i, j = cuda.grid(2)
  if i < out.shape[0] and j < out.shape[1]:
      # Arithmetic of bfloat16 type
      sum = a[i, j] + b[i, j]
      # Bfloat16 native intrinsics
      sin_of_sum = hsin(sum)
      # bf16 to f32 upcast
      f32 = float32(sin_of_sum)
      # f32 to bf16 downcast
      bf16 = nv_bfloat16(f32)
      # Assignment to external array
      out[i, j] = bf16


a = torch.ones([2, 2], device=torch.device("cuda:0"), dtype=torch.bfloat16)
b = torch.ones([2, 2], device=torch.device("cuda:0"), dtype=torch.bfloat16)
expected = torch.sin(a + b)

out = torch.zeros([2, 2], device=torch.device("cuda:0"), dtype=torch.bfloat16)

threadsperblock = (16, 16)
blockspergrid = (1, 1)
torch_add_sin[blockspergrid, threadsperblock](a, b, out)
assert torch.equal(expected, out)

You can download Numbast and the bfloat16 Numba bindings from conda-forge:

conda install -c nvidia -c rapidsai -c conda-forge ml_dtypes numbast-extensions

Architecture

Numbast consists of two components:

AST_Canopy: An underlying layer that parses and serializes C++ headers
Numbast: A user-facing layer that consumes the parsed results and builds Numba bindings dynamically.

AST_Canopy: declaration parser

In forest ecology, a canopy refers to the upper layer of a forest habitat zone. AST_Canopy is the package that inspects the top-level declarations in the forest of abstract syntax trees, extracting information from them and passing them to the Python layer. Here, top level refers to the public, user-facing APIs that a CUDA C++ library exposes to users.

Figure 1 shows the architecture of AST_Canopy:

clangTooling: a Clang library used to support writing standalone tools, such as Numbast.
libastcanopy: Implements the declaration-parsing logic using clangTooling.
pylibastcanopy: A binding that directly exposes libastcanopy APIs in Python.
AST_Canopy: A layer on top of pylibastcanopy that provides a pleasantly Pythonic user experience.

In addition to header parsing and serialization, AST_Canopy also provides the following features:

Environment detection at runtime: Auto-detects the libstdcxx and CUDA headers installed through Conda packages, and sets up the clang compilers accordingly.
Flexibility in compute capability parsing: Enables configuring AST parsing based on different compute capabilities. Some headers conditionally expose code based on compute capability, and this feature supports cases where the header serialization and runtime environments are different.

Numbast: binding generator

Numbast is downstream to AST_Canopy, which consumes the declaration information and generates Numba bindings automatically. At a high level, Numbast exists to provide a translation layer between C++ and Python syntax. As the demo suggests, most simple C++ syntax finds its natural counterpart in Python (Table 1).

Operation	CUDA C++	Numba
Object construction	`auto hpi = myfloat16(3.14)`	`hpi = myfloat16(3.14)`
Attribute access	`auto data = hpi.data`	`data = hpi.data`
Function invocation	`auto r = hsqrt(hpi)`	`r = hsqrt(hpi)`
Type casts	`auto fpi = float(hpi);`	`fpi = types.float32(hpi)`
Arithmetic operations	`auto pi2 = hpi + hpi`	`pi2 = hpi + hpi`

Table 1. Mapping of CUDA C++ syntax onto Python syntax

Numba’s type system has much in common with C and C++-like languages. Of course, there are also features not present in Python, such as pointer semantics and template-based meta-programming.

Numbast is the middle layer that encapsulates the similarities and differences between C++ and Python.

Lowering: The bigger picture

Bindings generated using Numbast are lowered with a feature in Numba called foreign function invocation (FFI). Numba ABI-compatible shim functions wrapping over the native CUDA function call are generated and then compiled with NVRTC. Expect the same optimized performance that a CUDA C++ developer enjoys, minus the performance of FFI.

Future versions of Numba-cuda will introduce link time optimization (LTO) support, further eliminating the performance gap between accelerated Numba kernel and native CUDA C++.

Caveats

Both AST_Canopy and Numbast have caveats worth noting. AST_Canopy depends on clangTooling. Both new CUDA language features that are not yet supported by clangTooling and libraries that depend on the new language feature may not be correctly parsed. However, the majority of libraries and headers make use of features supported by clangTooling.

Conclusion

In this post, we introduced a new Numba binding generation tool, Numbast. We showed that, by using Numbast, you can quickly benefit from the ever-growing feature set of CUDA C++.

Numbast v0.1.0 provides the new data type bfloat16 to Numba. You can expect to see more bindings generated by Numbast, including new data types, NVSHMEM bindings, and CCCL bindings.

Discuss (0)

About the Authors

About Michael Yh Wang
Michael Yh Wang is a software engineer on NVIDIA RAPIDS. He currently contributes his engineering skills towards cuDF, cuSpatial, and Numba. Before NVIDIA, he received an M.Sc. from Yale University. His earlier experience includes working as a visual effect supervisor at an independent film project, and achieving the first place at the WAIC 2020 hackathon competition. Michael has strong interests in software engineering, computer graphics algorithms, and compiler technology. He believes in a future where accelerated computing can become more accessible to the public through compiler and language innovations.

View all posts by Michael Yh Wang

About Graham Markall
Graham Markall is a software engineer in NVIDIA RAPIDS, where he leads work on Numba (a Python JIT compiler), supporting its use as a key part of the CUDA Python ecosystem. He has a background in compilers and numerical methods, beginning with his Ph.D. work at Imperial College that focused on runtime code generation for finite element solvers running on GPUs and multicore CPUs. Subsequently, he worked on compiler and math libraries for Python and Java, and on GCC- and LLVM-based toolchains for various specialized architectures before joining NVIDIA in 2019.

View all posts by Graham Markall