You may want to read the more recent post Getting Started with OpenACC by Jeff Larkin.
In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our code. This will provide us with significantly higher speedup.
Example Source Code
You can browse and download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1, step 2, step 3). The examples use OpenACC 1.0 syntax, so you’ll need a compiler that supports it. I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.
Let’s pick up where we left off last week.
STEP 3: Tuning Parallelization Configuration
To refresh, this is what our C code looks like after step 2 from the last post.
#pragma acc data copy(A, Anew) while ( error > tol && iter < iter_max ) { error = 0.f; #pragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = fmaxf( error, fabsf(Anew[j][i]-A[j][i])); } } #pragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; } } if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error); iter++; } double runtime = GetTimer(); printf(" total: %f sn", runtime / 1000.f); }
Here’s the Fortran code.
!$acc data copy(A, Anew) do while ( error .gt. tol .and. iter .lt. iter_max ) error=0.0_fp_kind !$omp parallel do shared(m, n, Anew, A) reduction( max:error ) !$acc kernels do j=1,m-2 do i=1,n-2 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + & A(i ,j-1) + A(i ,j+1) ) error = max( error, abs(Anew(i,j)-A(i,j)) ) end do end do !$acc end kernels !$omp end parallel do if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error iter = iter +1 !$omp parallel do shared(m, n, Anew, A) !$acc kernels do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) end do end do !$acc end kernels !$omp end parallel do end do !$acc end data
To begin, let’s run the C code from step 2 on the GPU again, this time enabling -ta=nvidia,time
. We see output like the following from the PGI compiler.
total: 8.940413 s Accelerator Kernel Timing data /home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c main 92: region entered 1000 times time(us): total=2336562 init=298 region=2336264 kernels=2295838 data=0 w/o init: total=2336264 max=2441 min=2318 avg=2336 95: kernel launched 1000 times grid: [256x256] block: [16x16] time(us): total=2295838 max=2301 min=2293 avg=2295 /home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c main 80: region entered 1000 times time(us): total=6489988 init=190 region=6489798 kernels=6184945 data=0 w/o init: total=6489798 max=6960 min=6397 avg=6489 83: kernel launched 1000 times grid: [256x256] block: [16x16] time(us): total=6072138 max=6186 min=6012 avg=6072 87: kernel launched 1000 times grid: [1] block: [256] time(us): total=112807 max=221 min=111 avg=112 /home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c main 74: region entered 1 time time(us): total=8936004 data=96980 acc_init.c acc_init 29: region entered 1 time time(us): init=82801
We can see 3 regions. The first one is the “memcpy” loop nest starting on line 92, which takes about 2.3s out of 8.9s total. The second one is the main computation loop region starting on line 80, which takes about 6.5s. The third region mentioned is the enclosing while loop data region starting on line 74. We can see that this region takes 8.9s, which is good—that means there is no other part of the program that takes significant time. If we look at the main loop nests, we can see these lines:
grid: [256x256] block: [16x16]
The terms grid and block come from the CUDA programming model. The GPU executes groups of threads called thread blocks. To execute a kernel, the application launches a grid of these thread blocks. Each block runs on one of the GPU’s multiprocessors and is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 256 threads each, arranged in 16×16 two-dimensional blocks. The grid is also 2D, 256 blocks wide and 256 blocks tall. This is just enough to cover our 4096×4096 grid. But we don’t really need that many blocks—if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.
I have tried this and found that it is more efficient for this kernel. Also, blocks 32 threads wide execute this code faster than 16-wide blocks, because the GPU processes instructions 32 threads at a time (a group of 32 threads is called a warp in the CUDA architecture). To specify these dimensions, let’s modify the code for the main computational loop nest as follows.
#pragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels loop gang(32), vector(16) for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ )
The gang(32)
clause on the outer loop tells the compiler to launch 32 blocks in the Y (row) direction. Thegang(16)
clause on the inner loop tells it to launch 16 blocks in the X (column) direction. The vector(16)
clause on the outer loop tells the compiler to use blocks that are 16 threads tall, thus processing the loop iterations in SIMD groups of 16. Finally, the vector(32)
clause on the inner loop tells the compiler to use blocks that are 32 threads wide (one warp wide). I also found that making a similar change to the copy loop nest benefits performance a small amount. In this case I found it best to only apply the gang
and vector
clauses to the inner loop, to ensure the blocks used are as wide as a warp. Here are the changes to the C code.
#pragma omp parallel for shared(m, n, Anew, A) #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc loop gang(16), vector(32) for( int i = 1; i < m-1; i++ )
The changes to the Fortran code are very similar. Here is the first loop nest.
!$omp parallel do shared(m, n, Anew, A) reduction( max:error ) !$acc kernels loop gang(32), vector(16) do j=1,m-2 !$acc loop gang(16), vector(32) do i=1,n-2 Anew(i,j) = 0.25_fp_kind * ( A(i+1,j ) + A(i-1,j ) + & A(i ,j-1) + A(i ,j+1) ) error = max( error, abs(Anew(i,j)-A(i,j)) ) end do !$acc end loop end do !$acc end kernels !$omp end parallel do
And here is the second Fortran loop nest.
!$omp parallel do shared(m, n, Anew, A) !$acc kernels loop do j=1,m-2 !$acc loop gang(16), vector(32) do i=1,n-2 A(i,j) = Anew(i,j) end do !$acc end loop end do !$acc end kernels !$omp end parallel do
While we’re at it, let’s make one other small change. We don’t need to copy Anew
to and from the GPU, since it is only accessed on the GPU. We can modify the data region directive to specify that Anew is allocated only on the GPU using the create clause on the data directive.
#pragma acc data copy(A), create(Anew)
Here is the performance after these changes.
Execution | Time (s) | Speedup vs. 1 CPU thread | Speedup vs. 4 CPU threads |
CPU 1 thread | 34.14 | — | — |
CPU 4 threads | 21.16 | 1.61x | 1.0x |
GPU | 5.32 | 6.42x | 3.98x |
We’ve increased performance of this application by 4x (comparing CPU “socket” to GPU “socket”) by adding just a few lines of compiler directives!
Wrapping Up
OpenACC is a promising new standard that I think will be a powerful tool for high-productivity acceleration of existing C and Fortran code. It will also be very valuable for prototyping; programmers can use it to determine if it is worthwhile to spend more time writing lower-level CUDA code for GPUs, for example. I hope you have found this overview interesting—please let me know what you think in the comments! I’ll revisit OpenACC in a future post, possibly to discuss some of the challenges you might face in applying OpenACC to more complicated codes, or to provide some guidelines on getting the best results.
If you are interested in trying OpenACC you can download a free trial of PGI Accelerator to try it out. The OpenACC version 1.0 specification is available here. A reference card and guides are available here.