Simulation / Modeling / Design

C++11 in CUDA: Variadic Templates

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such as lambda functions, range-based for loops, and automatic type deduction (auto). In this post, I’ll cover variadic templates.

There are times when you need to write functions that take a variable number of arguments: variadic functions. To do this in a typesafe manner for polymorphic functions, you really need to take a variable number of types in a template. Before C++11, the only way to write variadic functions was with the ellipsis (...) syntax and the va_* facilities. These facilities did not enable type safety and can be difficult to use.

As an example, let’s say we want to abstract the launching of GPU kernels. In my case, I want to provide simpler launch semantics in the Hemi library. There are many cases where you don’t care to specify the number and size of thread blocks—you just want to run a kernel with “enough” threads to fully utilize the GPU, or to cover your data size. In that case we can let the library decide how to launch the kernel, simplifying our code. But to launch arbitrary kernels, we have to support arbitrary type signatures. Well, we can do that like this:

template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &p, 
                Arguments... args);

Here, Arguments... is a “type template parameter pack”. We can use it to refer to the type signature of our kernel function pointer f, and to the arguments of cudaLaunch. To do the same thing before C++11 (and CUDA 7) required providing multiple implementations of cudaLaunch, one for each number of arguments we wanted to support. That meant you had to limit the maximum number of arguments allowed, as well as the amount of code you had to maintain. In my experience this was prone to bugs. Here’s the implementation of cudaLaunch.

// Generic simplified kernel launcher
// configureGrid uses the CUDA Occupancy API to choose grid/block dimensions
template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &policy, 
                void (*f)(Arguments...), 
                Arguments... args)
    ExecutionPolicy p = policy;
    checkCuda(configureGrid(p, f));
    f<<<p.getGridSize(), p.getBlockSize(), p.getSharedMemBytes()>>>(args...);

// and a wrapper for default policy -- i.e. automatic execution configuration
template <typename... Arguments>
void cudaLaunch(void(*f)(Arguments... args), Arguments... args)
    cudaLaunch(ExecutionPolicy(), f, args...);

Here you can see how we access the types of the arguments (Arguments...) in the definition our variadic template function, in order to specify the type signature of the kernel function pointer *f. Inside the function, we unpack the parameters using args... and pass them to our kernel function when we launch it. C++11 also lets you query the number of parameters in a pack using sizeof...().

Using hemi::cudaLaunch, I can launch any __global__ kernel, regardless of how many parameters it has, like this (here I’m launching my xyzw_frequency kernel from my post The Power of C++11 in CUDA 7.

hemi::cudaLaunch(xyzw_frequency, count, text, int n);

Here we leave the launch configuration up to the runtime, and if we write our kernel in a portable way, this code can be made fully portable. This simplified launch code is currently available in a development branch of Hemi, which you can find on Github.

Variadic Kernels

Of course, you can also define kernel functions and __device__ functions with variadic arguments. I’ll finish up with a little program that demonstrates a few things. The __global__ function Kernel is a variadic template function which just forwards its parameter pack to the function adder, which is where the really interesting use of variadic templates happens. (I borrowed the adder example from an excellent post on variadic templates by Eli Bendersky.)

adder demonstrates how a variadic parameter pack can be unpacked recursively to operate on each parameter in turn. Note that to terminate the recursion we define the “base case” function template adder(T v);, so that when the parameter pack is just a single parameter it just returns its value. The second adder function unpacks one argument at a time because it is defined to take one parameter and then a parameter pack. Clever trick, and since all the recursion happens at compile time, the resulting code is very efficient.

We define a utility template function print_it with various specializations that print the type of an argument and its value. We launch the kernel with four different lists of arguments. Each time, we vary the type of the first argument to demonstrate how our variadic adder can handle multiple types, and the output has a different type each time. Note another C++11 feature is used here: static_assert and type traits. Our adder only works with integral and floating point types, so we check the types at compile time using static_assert to check if an arithmetic type is used. This allows us to print a custom error message at compile time when the function is misused.

#include <type_traits>
#include <stdio.h>

template<typename T>
__host__ __device__
T adder(T v) {
  return v;

template<typename T, typename... Args>
__host__ __device__
T adder(T first, Args... args) {
  static_assert(std::is_arithmetic<T>::value, "Only arithmetic types supported");
  return first + adder(args...);

template<typename T> 
__host__ __device__ 
void print_it(T x) { printf("Unsupported type\n"); }

__host__ __device__
void print_it(int x) { printf("int %d\n", x); }
__host__ __device__
void print_it(long int x) { printf("long int %ld\n", x); }
__host__ __device__
void print_it(float x) { printf("float %f\n", x); }
__host__ __device__
void print_it(double x) { printf("double %lf\n", x); }

template <typename... Arguments>
void Kernel(Arguments... args)
    auto sum = adder(args...);

struct { int x; } s;

int main(void) {
    Kernel<<<1, 1>>>(1, 2.0f, 3.0, 4, 5.0);    // "int 15"
    Kernel<<<1, 1>>>(1l, 2.0f, 3.0, 4, 5.0);   // "long int 15"
    Kernel<<<1, 1>>>(1.0f, 2.0f, 3.0, 4, 5.0); // "float 15.000000"
    Kernel<<<1, 1>>>(1.0, 2.0f, 3.0, 4, 5.0);  // "double 15.000000"
    // Kernel<<<1, 1>>>("1.0", 2.0f, 3.0, 4, 5.0);  // static assert!
    cudaDeviceReset(); // to ensure device print happens before exit
    return 0;

You can compile this code with nvcc --std=c++11 -o variadic.

Note that in CUDA 7, A variadic __global__ function template has the following (documented) restrictions:

  • Only a single pack parameter is allowed.
  • The pack parameter must be listed last in the template parameter list.

In practice I don’t find these limitations too constraining.

Try CUDA 7 Today

The CUDA Toolkit version 7 is available now, so download it today and try out the C++11 support and other new features.

Discuss (6)