Simulation / Modeling / Design

Using Fortran Standard Parallel Programming for GPU Acceleration

Jun 12, 2022

By Miko Stulajter, Ronald M. Caplan, Jeff Larkin and Jon Linker

Discuss (8)

AI-Generated Summary

Dislike

Using Fortran's do concurrent feature allows for cleaner, more portable code and future-proofs it by relying on ISO language standards, as seen in the refactoring of the POT3D code.
Replacing OpenACC directives with do concurrent in POT3D reduced the number of directives from 80 to 3, and resulted in a ~10% performance hit on NVIDIA A100 GPUs when using managed memory.
The performance loss was attributed to unified memory, and regaining original performance required adding data movement directives back in, resulting in 50% fewer directives than the original code while maintaining similar performance on multiple GPUs with MPI.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent loops and array math intrinsics in Fortran.

This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing:

Using standard language features has many advantages to offer, the chief advantage being future-proofness. As Fortran’s do concurrent is a standard language feature, the chances of support being lost in the future is slim.

This feature is also relatively simple to use in initial code development and it adds portability and parallelism. Using do concurrent for initial code development has the mental advantage of encouraging you to think about parallelism from the start as you write and implement loops.

For initial code development, do concurrent is a great way to add GPU support without having to learn directives. However, even code that has already been GPU-accelerated through the use of directives like OpenACC and OpenMP can benefit from refactoring to standard parallelism for the following reasons:

Cleaning up the code for those who do not know directives, or removing the large numbers of directives that make the source code distracting.
Increasing the portability of the code in terms of vendor support and longevity of support.
Future-proofing the code, as ISO-standard languages have a proven track record for stability and portability.

Replacing directives on multi-core CPUs and GPUs

POT3D is a Fortran code that computes potential field solutions to approximate the solar coronal magnetic field using surface field observations as input. It continues to be used for numerous studies of coronal structure and dynamics.

The code is highly parallelized using MPI and is GPU-accelerated using MPI with OpenACC. It is open source and available on GitHub. It is also a part of the SpecHPC 2021 Benchmark Suites.

We recently refactored another code example with do concurrent at WACCPD 2021. The results showed that you could replace directives with do concurrent without losing performance on multicore CPUs and GPUs. However, that code was somewhat simple in that there is no MPI.

Now, we want to explore replacing directives in more complicated code. POT3D contains nontrivial features for standard Fortran parallelism to handle: reductions, atomics, CUDA-aware MPI, and local stack arrays. We want to see if do concurrent can replace directives and retain the same performance.

To establish a performance baseline for refactoring the code to do concurrent, first review the initial timings of the original code in Figure 1. The CPU result was run on 64 MPI ranks (32 per socket) on a dual-socket AMD EPYC 7742 server, while the GPU result was run on one MPI rank on an NVIDIA A100 (40GB). The GPU code relies on data movement directives for data transfers (we do not use managed memory here) and is compiled with -acc=gpu -gpu=cc80,cuda11.5. The runtimes are an average over four runs.

The following highlighted text shows the number of lines of code and directives for the current version of the code. You can see that there are 80 directives, but we hope to reduce this number by refactoring with do concurrent.

	POT3D (Original)
Fortran	3,487
Comments	3,452
OpenACC Directives	80
Total	7,019

Table 1. Code facts for POT3D including Fortran, comments, OpenACC Directives, and total lines.

Do concurrent and OpenACC

Here are some examples of do concurrent compared to OpenACC from the code POT3D, such as a triple-nested OpenACC parallelized loop:

!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
…
!$acc parallel loop default(present) collapse(3) async(1)
do k=1,np
 do j=1,nt
  do i=1,nrm1
   br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k))*dr_i(i)
  enddo
 enddo
enddo
…
!$acc wait
!$acc exit data delete(phi,dr_i,br)

As mentioned earlier, this OpenACC code is compiled with the flag -acc=gpu -gpu=cc80,cuda11.5 to run on an NVIDIA GPU.

You can parallelize this same loop with do concurrent and rely on NVIDIA CUDA Unified Memory for data movement instead of directives. This results in the following code:

do concurrent (k=1:np,j=1:nt,i=1:nrm1)
 br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo

As you can see, the loop has been condensed from 12 lines to three, and CPU portability and GPU parallelism are retained with the nvfortran compiler from the NVIDIA HPC SDK.

This reduction in the number of lines is thanks to collapsing multiple loops into one loop and relying on managed memory, which removes all data movement directives. Compile this code for the GPU with -stdpar=gpu -gpu=cc80,cuda11.5.

For nvfortran, activating standard parallelism (-stdpar=gpu) automatically activates managed memory. To use OpenACC directives to control data movement along with do concurrent, use the following flags: -acc=gpu -gpu=nomanaged.

The nvfortran implementation of do concurrent also allows for locality of variables to be defined:

do concurrent (k=1:N, j(i)>0) local(M) shared(J,K)
 M = mod(K(i), J(i))
 K(i) = K(i)- M
enddo

This may be necessary for some code. For POT3D, the default locality of variables performs as needed. The default locality is the same as OpenACC with nvfortran.

do concurrent CPU performance and GPU implementation

Replacing all the OpenACC loops with do concurrent and relying on managed memory for data movement leads you to code with zero directives and fewer lines. We removed 80 directives and 66 lines of Fortran.

Figure 2 shows that this do concurrent version of the code has nearly the same performance on the CPU as the original GitHub code. This means that you haven’t broken CPU compatibility by using do concurrent. Instead, multi-core parallelism has also been added, which can be used by compiling with the flag -stdpar=multicore.

Unlike the case of the CPU, to be able to run POT3D on a GPU, you must add a couple of directives.

First, to take advantage of multiple GPU with MPI, you need a directive to specify the GPU device number. Otherwise, all MPI ranks would use the same GPU.

!$acc set device_num(mpi_shared_rank_num)

In this example, mpi_shared_rank_num is the MPI rank within a node. It’s assumed that the code is launched such that the number of MPI ranks per node is the same as the number of GPUs per node. This can also be accomplished by setting CUDA_VISIBLE_DEVICES for each MPI rank, but we prefer doing this programmatically.

When using managed memory with multiple GPUs, make sure that the device selection (such as !$acc set device_num(N)) is done before any data is allocated. Otherwise, an additional CUDA context is created, introducing additional overhead.

Currently, the nvfortran compiler does not support array reductions on concurrent loops, which are required in two places in the code. Fortunately, an OpenACC atomic directive can be used in place of an array reduction:

do concurrent (k=2:npm1,i=1:nr)
!$acc atomic
 sum0(i)=sum0(i)+x(i,2,k)*dph(k )*pl_i
enddo

After adding this directive, change the compiler options to enable OpenACC explicitly by using -stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5. This allows you to use only three OpenACC directives. This is the closest this code can come to not having directives at this time.

All the data movement directives are unnecessary, since CUDA managed memory is used for all data structures. Table 2 shows the number of directives and lines of code needed for this version of POT3D.

	POT3D (Original)	POT3D (Do Concurrent)	Difference
Fortran	3487	3421	(-66)
Comments	3452	3448	(-4)
OpenACC Directives	80	3	(-77)
Total	7019	6872	(-147)

Table 2. Number of lines of code for the GPU compatible do concurrent version of POT3D with a breakdown of the number of Fortran lines, directives, and comments lines.

For the reduction loops in POT3D, you relied on implicit reductions, but that may not always work. Recently, nvfortran has added the upcoming Fortran 202X reduce clause, which can be used on reduction loops as follows:

do concurrent (k=1:N) reduce(+:cgdot)
 cgdot=cgdot+x(i)*y(i)
enddo

GPU performance, unified memory, and data movement

You’ve developed code with the minimal number of OpenACC directives and do concurrent that relies on managed memory for data movement. This is the closest directive-free code that you can have at this time.

Figure 3 shows that this code version takes a small performance hit of ~10% when compared to the original OpenACC GPU code. The cause of this could be do concurrent, managed memory, or a combination.

To see if managed memory causes the small performance hit, compile the original GitHub code with managed memory turned on. This is done by using the compile flag -gpu=managed in addition to the standard OpenACC flags used before for the GPU.

Figure 4 shows that the GitHub code now performs similar to the minimum directives code with managed memory. This means that the culprit for the small performance loss is unified memory.

To regain the performance of the original code with the minimal directives code, you must add the data movement directives back in. This mixture of do concurrent and data movement directives would look like the following code example:

!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
 br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)

This results in the code having 41 directives, with 38 being responsible for data movement. To compile the code and rely on the data movement directives, run the following command:

-stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5,nomanaged

nomanaged turns off managed memory and -acc=gpu turns on the directive recognition.

Figure 5 shows that there is nearly the same performance as the original GitHub code. This code has 50% fewer directives than the original code and gives the same performance!

MPI + DO CONCURRENT scaling

Figure 7 shows the timing results using multiple GPUs. The primary takeaway is that do concurrent works with MPI over multiple GPUs.

Looking at the codes with managed memory turned on (blue lines), you can see that the original code and the minimal directives code gave nearly the same performance as multiple GPUs were used.

Looking at the codes with managed memory turned off (green lines), you can again see the same scaling between the original GitHub code and the do concurrent version of the code. This indicates that do concurrent works with MPI and has no impact on the scaling you should see.

What you might also notice is that managed memory causes an overhead as the GPUs are scaled. The managed memory runs (blue lines) and data directive lines (green lines) are parallel to each other, meaning the overhead scales with the number of GPUs.

Fortran standard parallel programming review

You may be wondering, “Standard Fortran sounds too good to be true, what is the catch?”

Fortran standard parallel programming enables cleaner looking code and increases the future proofness of your code by relying on ISO language standards. Using the latest nvfortran compiler, you gain all the benefits mentioned earlier.

Although you lose the current GCC OpenACC/MP GPU support when you transition to do concurrent, we expect to gain more GPU support in the future as other vendors add support of do concurrent on GPUs. Given the track record of ISO language standards, we believe that this support will come.

Using do concurrent does currently come with a small number of limitations, namely the lack of support for atomics, device selection, asynchrony, or optimizing data movement. As we have shown, however, each of these limitations can be easily worked around using compiler directives. Far fewer directives are required thanks to the native parallel language features in Fortran.

Ready to get started? Download the free NVIDIA HPC SDK, and start testing! If you are also interested in more details on our findings, see the From Directives to DO CONCURRENT: A Case Study in Standard Parallelism GTC session. For more information about standard language parallelism, see Developing Accelerated Code with Standard Language Parallelism.

Acknowledgments

This work was supported by the National Science Foundation, NASA, and the Air Force Office of Scientific Research. Computational resources were provided by the Computational Science Resource Center at San Diego State University.

Discuss (8)

About the Authors

About Miko Stulajter
Miko Stulajter is currently working as a research assistant at Predictive Science, focusing on high performance computing and algorithm development. His current research interests are in high performance computing and machine learning. He's also pursuing his Ph.D. in computational science in the joint doctoral program at San Diego State University and the University of California, Irvine. He has an M.S.E. in materials science and engineering from the University of Pennsylvania and a B.S. in mathematics and computer science and physics from DePaul University.

View all posts by Miko Stulajter

About Ronald M. Caplan
Dr. Ronald M. Caplan is a computational scientist whose main interests are in developing and optimizing numerical methods for simulating physics-based models and their implementations in parallel high-performance-computing environments. His research currently focuses on the continued development and optimization of Predictive Science's magnetohydrodynamic codes used to study the solar corona and heliosphere, as well as providing computational solutions for additional projects.

View all posts by Ronald M. Caplan

About Jeff Larkin
Jeff is a Principal HPC Application Architect in NVIDIA's HPC Software team. He is passionate about the advancement and adoption of parallel programming models for High Performance Computing. He was previously a member of NVIDIA's Developer Technology group, specializing in performance analysis and optimization of high performance computing applications. Jeff is also the chair of the OpenACC technical committee and has worked in both the OpenACC and OpenMP standards bodies. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory.

View all posts by Jeff Larkin

About Jon Linker
Dr. Linker's interests in solar and heliospheric physics include the structure of the large-scale corona, the evolution and eruption of coronal magnetic fields, the initiation of coronal mass ejections, and the application of coronal and solar wind models to space weather forecasting.

View all posts by Jon Linker