Simulation / Modeling / Design

Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK

Fortran developers have long been able to accelerate their programs using CUDA Fortran or OpenACC.

For more up-to-date information, please read Using Fortran Standard Parallel Programming for GPU Acceleration, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing.

Now with the latest 20.11 release of the NVIDIA HPC SDK, the included NVFORTRAN compiler automatically accelerates DO CONCURRENT, allowing you to get the benefit of the full power of NVIDIA GPUs using ISO Standard Fortran without any extensions, directives, or non-standard libraries.

For more up-to-date information, please read Using Fortran Standard Parallel Programming for GPU Acceleration, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing.

You can now write standard Fortran, remaining fully portable to other compilers and systems, and still benefit from the full power of NVIDIA GPUs. NVFORTRAN was built so that you can spend less time porting and more time on what really matters—solving the world’s problems with computational science.

ISO Standard Fortran 2008 introduced the DO CONCURRENT construct to allow you to express loop-level parallelism, one of the various mechanisms for expressing parallelism directly in the Fortran language. In this post, we focus on DO CONCURRENT. For more information about how the NVIDIA HPC SDK automatically accelerates Fortran array intrinsics on NVIDIA GPUs, see Bringing Tensor Cores to Standard Fortran.

NVIDIA HPC SDK

The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools used to GPU-accelerate HPC applications. With support for NVIDIA GPUs and x86-64, OpenPOWER, or Arm CPUs running Linux, the NVIDIA HPC SDK provides proven tools and technologies for building cross-platform, performance-portable, and scalable HPC applications.

The NVIDIA HPC SDK includes the NVIDIA HPC Fortran compiler, NVFORTRAN. NVFORTRAN supports Fortran 2008, CUDA Fortran, OpenACC, and OpenMP.

Enabling Standard Parallelism

GPU acceleration of DO CONCURRENT is enabled with the -stdpar command-line option to NVFORTRAN. If -stdpar is specified, the compiler does the parallelization of the DO CONCURRENT loops and offloads them to the GPU. All data movement between host memory and GPU device memory is performed implicitly and automatically under the control of CUDA Unified Memory.

A program can be compiled and linked with the following command:

nvfortran -stdpar program.f90 -o program

It is also possible to target a multi-core CPU with the following command:

nvfortran -stdpar=multicore program.f90 -o program

You can also compile a program to run on either a CPU or GPU using the following command. If your system has a GPU, the program runs on the GPU. If not, the program runs on the CPU.

nvfortran -stdpar=gpu,multicore program.f90 -o program

Start accelerating Fortran DO CONCURRENT with NVIDIA GPUs

To get started with GPU acceleration of Standard Fortran, all that’s required is expressing parallelism with DO CONCURRENT. The NVFORTRAN compiler generates code to run the iterations across GPU threads and thread blocks, or run them across host threads. The compiler can choose different mappings depending on the underlying hardware.

SAXPY

SAXPY is a simple vector operation that computes a vector-scalar multiplication with a scalar addition: y = a*x + y. A Fortran subroutine that does this might look like the following:

1  subroutine saxpy(x,y,n,a)
2    real :: a, x(:), y(:)
3    integer :: n, i  
4    do i = 1, n 
5      y(i) = a*x(i)+y(i)
6    enddo  
7  end subroutine saxpy

For this code, the compiler generates n iterations of line 5, y(i) = a*x(i) + y(i). The n loop iterations are executed sequentially when the program runs.

However, this algorithm can be parallelized to improve performance. Given enough resources, each of the n iterations can be executed simultaneously. In ISO Standard Fortran, you can express this by rewriting the SAXPY example with DO CONCURRENT:

1  subroutine saxpy(x,y,n,a)
2    real :: a, x(:), y(:)
3    integer :: n, i  
4    do concurrent (i = 1: n) 
5      y(i) = a*x(i)+y(i)
6    enddo  
7  end subroutine saxpy

By changing the DO loop to DO CONCURRENT, you are telling the compiler that there are no data dependencies between the n loop iterations. This leaves the compiler free to generate instructions that the iterations can be executed in any order and simultaneously. The compiler parallelizes the loop even if there are data dependencies, resulting in race conditions and likely incorrect results. It’s your responsibility to ensure that the loop is safe to be parallelized.

Now, compile this program with -stdpar, which enables parallelism on an NVIDIA GPU. To see more information about the parallelization that the compiler enables, use the -Minfo flag with NVFORTRAN:

nvfortran -stdpar -Minfo=accel saxpy.f90 -o saxpy

NVFORTRAN prints the following -Minfo message, which indicates that the compiler parallelizes the DO CONCURRENT loop. For the NVIDIA GPU target, the compiler applies fine-grained parallelism. It distributes DO CONCURRENT iterations across the thread blocks and threads of the GPU kernel. As a result, the code takes advantage of the massive parallelism available in the GPU automatically.

saxpy:
      4, Offloading Do concurrent
         Generating Tesla code
         4, Loop parallelized across CUDA thread blocks, CUDA threads(128) blockidx%x threadidx%x

You can find the full SAXPY example included with the NVIDIA HPC SDK. After downloading, the code is located at /path/to/nvhpc/<platform>/<version>/examples/stdpar/saxpy/saxpy.f90.

Coding guidelines

In this section, we make some recommendations for getting optimal results when expressing parallelism in standard Fortran using DO CONCURRENT.

Express more parallelism

Nested loops are a common code pattern encountered in HPC applications. A simple example might look like the following:

24  do i=2, n-1
25    do j=2, m-1
26      a(i,j) = w0 * b(i,j) 
27    enddo
28  enddo

It is straightforward to write such patterns with a single DO CONCURRENT statement, as in the following example. It is easier to read, and the compiler has more information available for optimization.

25  do concurrent(i=2 : n-1, j=2 : m-1)
26    a(i,j) = w0 * b(i,j) 
27  enddo

If you compile this code with -stdpar -Minfo, youcan see how the compiler performs the parallelization. It first collapses the two loops as expressed by DO CONCURRENT and parallelizes them together across the thread blocks and threads of the GPU kernel.

     25, Offloading Do concurrent
         Generating Tesla code
         25, Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(2) ! blockidx%x threadidx%x collapsed-innermost
             Loop parallelized across  ! blockidx%x threadidx%x auto-collapsed

Array assignments

Fortran has syntax for array assignments that are commonly used. The compiler can automatically detect and parallelize the array assignments in the DO CONCURRENT loop. Here’s is an example:

4  do concurrent(i=1:N)
5     i0 = max(i-1,1)
6     i1 = min(i+1,N)
7     b(:,:,i) = 0.5 * (a(:,:,i0) + a(:,:,i1)) ! array assignment
8  end do

Look at the following parallelization message. The compiler first parallelizes the DO CONCURRENT loop across the thread blocks and then parallelizes the array assignment across the threads. In doing so, it uses a multi-dimensional kernel and makes use of massive parallelism to the fullest.

      4, Offloading Do concurrent4, Generating Tesla code
         4, Loop parallelized across CUDA thread blocks ! blockidx%x
         7, Loop parallelized across CUDA threads(128) collapse(2) ! threadidx%x

Using localities with DO CONCURRENT

Another requirement in HPC programs is data privatization. This can be expressed with the standard Fortran DO CONCURRENT statement:

DO CONCURRENT (concurrent-header) [locality-spec]

In that statement, the locality-spec value is one of the following:

local(variable-name-list)
local_init(variable-name-list)
share*d*(variable-name-list)
default(none)

Each thread of the DO CONCURRENT loop can have its own array by use of the local clause. If you also want its first value, then you can use local_init. Look at the following example:

2  real::vals(100)
...
5  do concurrent(i=0:n) local(vals)
6    do j=0, 100 
7      vals(...) = ...
8    enddo
9    ... = vals(...)
10 enddo

The compiler created vals for each processing unit on the outer loop and located vals into the faster CUDA shared memory.

5, Generating Tesla code
5, Loop parallelized across CUDA thread blocks ! blockidx%x
6, Loop parallelized across CUDA threads(128) ! threadidx%x
5, CUDA shared memory used for vals

However, you should be careful when privatizing variables. If the compiler cannot locate them in the shared memory, they would be created in the GPU’s global memory, which can sometimes hurt performance. For this reason, it is important to pay attention to the -Minfo output when privatizing variables. 

Limitations

There are a few limitations for the usage of DO CONCURRENT when using NVFORTRAN. First, procedure calls inside of a DO CONCURRENT loop are not permitted in the 20.11 release of NVFORTRAN. Secondly, there is no support for reductions in the Fortran 2018 standard specification that can be used with DO CONCURRENT. However, as this feature makes it into the working draft of the Fortran standard, NVFORTRAN will provide a preview implementation so that you can use standard parallelism features in your Fortran programs that also need to perform a reduction.

Jacobi example

Here’s a look at a slightly larger example, using the Jacobi method to solve the two-dimensional heat equation. A simple application of this equation is to predict the steady-state temperature distribution in a square plate that is heated on one side or corner. The Jacobi method consists of approximating the square plate with a two-dimensional grid of points where a two-dimensional array is used to store the temperature at each of these points. Each iteration updates the elements of the array from the values computed at the previous step.

In the following code example, you see the main computational kernel that updates the temperature on a 2D grid of points. It uses DO CONCURRENT loops to step through the 2D iteration space using the combined indexing capability of DO CONCURRENT as described earlier. This expresses more parallelism and allows the compiler to easily collapse the loops for greater optimization opportunities.

21  subroutine smooth( a, b, w0, w1, w2, n, m, niters )
22    real, dimension(:,:) :: a,b
23    real :: w0, w1, w2
24    integer :: n, m, niters
25    integer :: i, j, iter
26    do iter = 1,niters
27      do concurrent(i=2 : n-1, j=2 : m-1)
28        a(i,j) = w0 * b(i,j) + &
29                 w1 * (b(i-1,j) + b(i,j-1) + b(i+1,j) + b(i,j+1)) + &
30                 w2 * (b(i-1,j-1) + b(i-1,j+1) + b(i+1,j-1) + b(i+1,j+1))
31        enddo
32      do concurrent(i=2 : n-1, j=2 : m-1)
33        b(i,j) = w0 * a(i,j) + &
34                 w1 * (a(i-1,j) + a(i,j-1) + a(i+1,j) + a(i,j+1)) + &
35                 w2 * (a(i-1,j-1) + a(i-1,j+1) + a(i+1,j-1) + a(i+1,j+1))
36      enddo
37    enddo
38  end subroutine

If you compile this code with -stdpar -Minfo, you can see how the compiler parallelizes the loops:

smooth:
      27, Offloading Do concurrent
         Generating Tesla code
         27, Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(2) ! blockidx%x threadidx%x collapsed-innermost
             Loop parallelized across  ! blockidx%x threadidx%x auto-collapsed
     32, Offloading Do concurrent
         Generating Tesla code
         32, Loop parallelized across CUDA thread blocks, CUDA threads(128) collapse(2) ! blockidx%x threadidx%x collapsed-innermost
             Loop parallelized across  ! blockidx%x threadidx%x auto-collapsed

Jacobi performance

Figure 1 shows an almost 13x improvement in performance by running your completely ISO Standard Fortran code on a GPU compared to all 40 cores of a  CPU. This means that you don’t have to “port” your code to run on the GPU. Recompile using NVFORTRAN and standard Fortran DO CONCURRENT, and you are ready to gain the acceleration benefits of an NVIDIA GPU. Because the parallelism is expressed in Fortran with no other additional languages or hints, the same code can be compiled with multiple compilers and run efficiently on both multicore CPUs and NVIDIA GPUs.

Figure 1. CPU vs GPU performance comparison; programming models.

In this case, the ISO Fortran implementation can show nearly the same performance as the OpenACC implementation, which targeted heterogeneous systems explicitly. While performance varies depending on application characteristics, this shows the power of ISO language parallelism leveraging CUDA Unified Memory, allowing you to program in a single, familiar, parallel programming language.

You can find the full example included with the HPC SDK at /path/to/nvhpc/<platform>/<version>/examples/stdpar/jacobi/jacobi.f90.

Start running your Fortran programs on GPUs using DO CONCURRENT

You can start using DO CONCURRENT to get parallelism in your ISO Standard Fortran programs today on all NVIDIA GPUs, starting from the Pascal architecture and newer. To get started, download and install the NVIDIA HPC SDK on your x86-64, OpenPOWER, or Arm CPU-based system running a supported version of Linux.

The NVIDIA HPC SDK is freely available. After you have the NVIDIA HPC SDK installed on your system, the NVFORTRAN compiler is available under the /opt/nvidia/hpc_sdk directory structure.

  • To use the 20.11 version of the compilers including NVFORTRAN on a Linux/x86-64 system, add the directory /opt/nvidia/hpc_sdk/Linux_x86_64/20.11/compilers/bin to your path.
  • On an OpenPOWER or Arm CPU-based system, replace Linux_x86_64 with Linux_ppc64le or Linux_aarch64, respectively.

Supported NVIDIA GPUs

The NVFORTRAN compiler can automatically offload DO CONCURRENT loops to all NVIDIA GPUs starting from the Pascal architecture and newer. By default, NVFORTRAN auto-detects and generates GPU code for the type of GPU that is installed on the system on which the compiler is running.

To generate code for a specific GPU architecture, which may be necessary when the application is compiled and run on different systems, add the -gpu=ccXX command-line option to specify the desired compute capability. In addition, the compiler can generate executables targeted for multiple GPU architectures. For example, you can use the -gpu=cc70,cc80 option to generate code for Volta and Ampere architectures in a single binary.

The NVFORTRAN compiler can use CUDA libraries and technologies, so the NVIDIA HPC SDK compilers ship with an integrated CUDA toolchain, header files, and libraries to use during compilation. You don’t have to have the CUDA Toolkit installed on the system to use NVFORTRAN or the other HPC compilers from the HPC SDK.

When -stdpar is specified, NVFORTRAN uses the CUDA toolchain version that matches the CUDA driver installed on the system where the compilation is performed. To compile using a different version of the CUDA toolchain, use the -gpu=cudaX.Y option. For example, use the -gpu=cuda11.0 option to specify that your program should be compiled for a CUDA 11.0 system using the 11.0 toolchain.

Summary

Parallel programming using standard language constructs offers clear benefits to HPC programmers.  Using standard languages, you can compile your code to run on GPUs with no changes to your underlying source code, and have them perform at a level comparable to code using other programming models.  Codes written using standard language constructs can be compiled today to run on all HPC platforms. Finally, standard languages enhance code efficiency and aid programmer productivity so that you can maximize your throughput and minimize your coding time.

The Fortran standard allows you to write code that expresses parallelism using DO CONCURRENT. Starting with the 20.11 release, NVFORTRAN enables truly performance-portable parallel programs across multicore CPUs and NVIDIA GPUs. In this post, we showed you how to get an almost 13x performance improvement when moving a Jacobi solver from all 40 cores of a CPU to an NVIDIA A100 GPU, all without any changes to the standard Fortran code.

Discuss (26)

Tags