By Adnan Boz (GTC 2012 Guest Blogger)
It’s my first day at the GPU Technology Conference and I’ve already had the opportunity to meet gurus like Mark Harris (Chief Technologist, GPU Computing, NVIDIA, and founder of GPGPU.org) and learn about the latest advancements in the GPU and HPC arena from people like NVIDIA’s Will Ramey and Duncan Poole.
One of the hot topics so far is OpenACC, an open GPU directives standard that makes GPU programming straightforward and portable across parallel and multi-core processors (see: http://developer.nvidia.com/openacc).
OpenACC is as simple as adding the “#pragma acc kernels” directive to a parallelizable part of your code. (If you are familiar with OpenMP you may have noticed the similarity to using “#pragma omp parallel”. )
The best part is that we can use OpenACC with current CUDA libraries and programming languages, like CuFFT,CuBLAS, CuRAND, CuSPARSE, NPP, Thrust and others (see: http://developer.nvidia.com/gpu-accelerated-libraries).
An important point Mark Harris made in his presentation is how to analyze and optimize accelerator directives usage: for example, using the PGI_ACC_TIME environment variable as well as –Minfo=accel pgcc compiler option to determine where time is being spent.
Using these options, Mark identified a memory transfer bottleneck in the Jacobi Iteration example. I think it was a great example of how we can write programs in a wrong way and end up having a very slow-running application.
It is always important to know the capabilities of the tool we are using. In this case, Mark used data constructs, specifically the “#pragma acc data” directive,which allows managing data movementto achieve a 3x improvement compared to CPU alternatives.
If you are planning to step into using OpenACC, here are some tips and tricks:
- Eliminate pointer arithmetic (like using subscribed arrays instead of pointer-index arrays);
- Inline function calls in directive regions;
- Use contiguous memory for multi-dimensional arrays;
- Avoid excessive memory transfers by using data regions;
- Apply directives to (nested) for loops for best parallelization;
- Use array shaping to indicate data size to the complier.
That’s all for now from Day One at GTC! Be sure to watch the streamcast of this presentation.
About our Guest Blogger
Adnan Boz is an NVIDIA certified CUDA programmer working at Yahoo! He is also the organizer of the South Florida GPU Meetup.