OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs.

OpenACC tutorial: Three Steps to More Science

An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC Interop I’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA.

Using CUDA in OpenACC programs

To start, let’s assume that you’ve accelerated your application with OpenACC, but want to call an accelerated library or CUDA function using the arrays you’ve moved to the GPU using OpenACC. The host_data OpenACC construct makes the address of device data available on the host, so you can pass it to functions that expect CUDA device pointers. Whenever we use the arrays listed in the use_device clause within the host_data region, the compiler generates code to use the device copy of the arrays, instead of the host copy. It’s important to understand that this is a CPU function (e.g. cublasSaxpy in the following example), but it expects GPU memory. The example below uses OpenACC to make a call to cuBLAS, which is a linear algebra library provided with the CUDA Toolkit.

int main(int argc, char **argv)
{
  float *x, *y, tmp;
  int n = 1<<20, i;

  x = (float*)malloc(n*sizeof(float));
  y = (float*)malloc(n*sizeof(float));

  #pragma acc data create(x[0:n]) copyout(y[0:n])
  {
    #pragma acc kernels
    {
      for( i = 0; i < n; i++)
      {
        x[i] = 1.0f;
        y[i] = 0.0f;
      }
    }

    #pragma acc host_data use_device(x,y)
    {
      cublasSaxpy(n, 2.0, x, 1, y, 1);
    }
  }

  fprintf(stdout, "y[0] = %f\n",y[0]);
  return 0;
}

In this example I’ve done all CPU and GPU memory management with the OpenACC data region, offloaded the array initialization loop to the GPU using the OpenACC kernels directive, and then passed my device arrays to the cublasSaxpy function using the host_data directive. Notice that when using host_data you must give a list of the device arrays with the use_device clause. That’s all there is to it. Want to see the same example using Fortran?

program main
  use cublas
  integer, parameter :: N = 2**20
  real, dimension(N) :: X, Y

  !$acc data create(x,y)
  !$acc kernels
  X(:) = 1.0
  Y(:) = 0.0
  !$acc end kernels

  !$acc host_data use_device(x,y)
  call cublassaxpy(N, 2.0, x, 1, y, 1)
  !$acc end host_data
  !$acc update self(y)
  !$acc end data

  print *, y(1)
end program

In this case I’m using the cublas Fortran module provided by the PGI CUDA Fortran compiler. Even though I’ve used the cuBLAS library in both of these examples, the functions could have just as easily been CUDA C or CUDA Fortran API functions, or host functions that launch CUDA kernels. Please see the github repository above for more examples using CUDA C and CUDA Fortran.

Using OpenACC in CUDA Programs

The above examples assume you want to use CUDA or an accelerated library from an existing OpenACC program, but what if you want to do the reverse? OpenACC has that covered too. Let’s say that I already have GPU arrays that I’ve allocated using CUDA. In that case, I use the deviceptr data clause instead of the host_data directive to tell the compiler that my data already resides on the GPU. In the examples below, I’ve written a simple implementation of the SAXPY routine in C and Fortran, which I’ve accelerated with OpenACC. All I have to do to use an array that is already on the GPU is to add the deviceptr clause to my OpenACC region.

void saxpy(int n, float a, float * restrict x, float * restrict y)
{
  #pragma acc kernels deviceptr(x,y)
  {
    for(int i=0; i<n; i++)
    {
      y[i] += a*x[i];
    }
  }
}

subroutine saxpy(n, a, x, y)
  integer :: n
  real    :: a, x(:), y(:)
  !$acc parallel deviceptr(x,y)
  y(:) = y(:) + a * x(:)
  !$acc end parallel
end subroutine

You can use the deviceptr clause on any parallel, kernels, or data construct. If you have a lot of OpenACC code that will use these arrays, you can simply wrap all of them with a data region that declares the array as a device pointer. But that only really works if your OpenACC is in just a few places, so what if you want to use the CUDA copy of x and y everywhere x and y are used in the code? In that case, the acc_map_data() function is your new best friend.

acc_map_data() instructs the OpenACC runtime that rather than allocating device memory for a variable it should map an existing device array to the host array. In other words, any time it sees x it should use d_x on the device. (Note: acc_map_data and accompanying acc_unmap_data functions are only specified for C and C++, although a compiler may choose to make them available in Fortran as well).

For this example I’ve created a couple of helper routines: map, which calls acc_map_data() for a given pair of pointers, and set, which sets all values of the given array to a specified value.

int main(int argc, char **argv)
{
  float *x, *y, *dx, *dy, tmp;
  int n = 1<<20;

  x = (float*) malloc(n*sizeof(float));
  y = (float*) malloc(n*sizeof(float));
  cudaMalloc((void**)&dx,(size_t)n*sizeof(float));
  cudaMalloc((void**)&dy,(size_t)n*sizeof(float));

  map(x, dx, n*sizeof(float));
  map(y, dy, n*sizeof(float));

  set(n,1.0f,x);
  set(n,0.0f,y);

  saxpy(n, 2.0, x, y);
  cudaMemcpy(&tmp,dy,(size_t)sizeof(float),cudaMemcpyDeviceToHost);
  printf("%f\n",tmp);
  return 0;
}

#include <openacc.h>

void map(float * restrict harr, float * restrict darr, int size)
{
  acc_map_data(harr, darr, size);
}

void saxpy(int n, float a, float * restrict x, float * restrict y)
{
  #pragma acc kernels present(x,y)
  {
    for(int i=0; i<n; i++)
    {
      y[i] += a*x[i];
    }
  }
}

The acc_map_data routine is a great way to “set it and forget it” when it comes to sharing memory between CUDA and OpenACC.

Combining OpenACC and Thrust

If you’re a C++ programmer, you may be thinking “what about mixing OpenACC and thrust?” Well, you’re in luck. Even though OpenACC and Thrust share a common goal of raising GPU programming to a higher level, it’s possible to mix and match between them too. In the example below I create Thrust device vectors, which I initialize using the fill function and then pass on to the same OpenACC saxpy routine I showed above with the deviceptr clause. There’s nothing particularly tricky here, once you know how to expose the device pointer of a Thrust array (thrust::raw_pointer_cast()).

#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>

extern "C" void saxpy(int,float,float*,float*);

int main(int argc, char **argv)
{
  int N = 1<<20;
  thrust::host_vector y(N);

  thrust::device_vector d_x(N);
  thrust::device_vector d_y(N);

  thrust::fill(d_x.begin(),d_x.end(), 1.0f);
  thrust::fill(d_y.begin(),d_y.end(), 0.0f);

  saxpy(N, 2.0, thrust::raw_pointer_cast(d_x.data()), 
        thrust::raw_pointer_cast(d_y.data()));

  y = d_y;
  printf("%f\n",y[0]);
  return 0;
}

If you weren’t already convinced to try OpenACC by our past posts, I hope I’ve convinced you that you’ve got nothing to lose by trying it. OpenACC is a great way for both GPU novices and experts to rapidly accelerate applications and it’s made even more powerful by its ability to interoperate with the rest of the GPU-computing ecosystem.