A Compilation Pipeline Taking User-Defined Functions in Python to CUDA Kernels

Numba is the just-in-time compiler used in RAPIDS cuDF to implement high-performance user-defined functions (UDFs) by turning user-supplied Python functions into CUDA kernels. But how does it go from Python code to CUDA kernel?

In this post, I discuss Numba’s compilation pipeline. If you enjoy diving into Numba’s internals, see the accompanying notebook that shows each stage in more depth, with code provided to get at the internal representations with each pipeline stage, along with the linking, loading, and kernel launch process.

The challenge and an overview of the pipeline

Compiling Python source code to machine code is challenging because the two representations are quite different (Table 1).

	Python	Machine code
Typing	No type information; dynamic; “duck typing”	Every instruction and value has a type
Abstraction level	Very expressive: classes, objects, methods, comprehensions, etc.	Simple instructions: 2 or 3 operations, mostly a single operation per instruction
Target	Runs on any machine on which the Python interpreter runs	Highly specific to one architecture

Table 1. Comparison of Python to machine code

This comparison is not exhaustive, but it highlights the problem. Numba must take an expressive, dynamic language and translate it into one that uses simple, specific instructions specialized for particular types. Numba’s pipeline does this with a sequence of stages, each moving the representation further from the Python source, and closer to executable machine code.

The Numba compilation pipeline starts with Python source code and takes it through the following stages to generate PTX code for CUDA GPUS. There are seven stages of the pipeline: