We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).
Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads execute cudaSetDevice() to set the current GPU.
cudaError_t cudaSetDevice(int device)
After this call all CUDA API commands go to the current set device until cudaSetDevice() is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.
cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
{
  kernel<<<blocks,threads>>>(a);
}
While at first glance this code may seem bug free, it is incorrect. The problem here is that we have set device 1 current on the OpenMP master thread but then used OpenMP to spawn more threads which will use the default device (device 0) because they never call cudaSetDevice(). This code would actually launch multiple kernels that run on device 0 but access memory allocated on device 1. This will cause either invalid memory access errors or (in the case where peer-to-peer access is enabled) it will be limited by low PCIe memory bandwidth to the array a.
Here is a correct implementation of the code, where every thread sets the correct device.
cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
{
  cudaSetDevice(1);
  kernel<<<blocks,threads>>>(a);
}
If it’s not obvious from the title of this post, there’s a simple rule to follow to avoid bugs like this…
Always Set the Device in New Host Threads
Make it a habit to call cudaSetDevice() wherever your code could potentially spawn new host threads. The following example has a potential bug depending on whether the OpenMP library chooses to spawn new threads or reuse old ones.
cudaSetDevice(1);
cudaMalloc(&a,bytes);
#pragma omp parallel
{
  cudaSetDevice(1);
  kernel<<<blocks,threads>>>(a);
}
#pragma omp parallel
{
    kernel<<<blocks,threads>>>(a);
}
In this example, threads in the second omp parallel region don’t set the current device so there is no guarantee that it is set for each thread. This problem is not restricted to OpenMP; it can easily happen with any threading library, and in any CUDA-accelerated Language.
To save yourself from a variety of multithreading bugs, remember: always call cudaSetDevice() first when you spawn a new host thread.
 
         
           
     
     
     
     
     
    