Bringing Tensor Cores to Standard Fortran

Tuned math libraries are an easy and dependable way to extract the ultimate performance from your HPC system. However, for long-lived applications or those that need to run on a variety of platforms, adapting library calls for each vendor or library version can be a maintenance nightmare.

A compiler that can automatically generate calls to tuned math libraries gives you the best of both worlds: easy portability and ultimate performance. In this post, I show how you can seamlessly accelerate many standard Fortran array intrinsics and language constructs on GPUs. The nvfortran compiler enables this acceleration automatically by mapping Fortran statements to the functions available in the NVIDIA cuTENSOR library, a first-of-its-kind, GPU-accelerated, tensor linear algebra library providing tensor contraction, reduction, and element-wise operations.

An easy on-ramp to NVIDIA GPUs

Here’s how standard Fortran array intrinsic functions can map to GPU-accelerated math libraries. At the simplest level, only two Fortran statements are required to take advantage of the outstanding performance provided by the cuTENSOR library:

use cutensorex
...
c = matmul(a,b)

The first statement using the cutensorex predefined module includes interfaces to the cuTENSOR library in the form of overloaded Fortran intrinsic procedures, array expressions, and overloaded assignments. The interfaces are written to map only arrays located in GPU device memory. Later in this post, I discuss what that means from the OpenACC and CUDA Fortran programmer point-of-view. With these interfaces defined, the second statement containing the matmul() intrinsic call is automatically mapped to a cuTENSOR function call.

The interfaces implement deferred execution by recognizing and matching several commonly used patterns that can map to a single cuTENSOR kernel invocation. In all cases, multiple cuTENSOR functions are called to set up the handle, descriptor data structures, and work buffers that cuTENSOR requires.

However, only one kernel is launched onto the GPU. It is important for performance reasons to map the entire statement, including the assignment to the left-side array. You don’t want the compiler creating temporary arrays, as is common in Fortran, for either the inputs or results (intermediate or final) of the right-side operation.

Supported standard Fortran operations

The cuTENSOR library contains general permutation and contraction operations. The result of the permutation can optionally be operated on by an elemental function, and optionally scaled.

The nvfortran compiler can recognize and map a variety of Fortran transformational intrinsics and elemental intrinsic functions used in combination with general array syntax to cuTENSOR functionality. A few of the more straightforward translations include the following:

d = transpose(a)
d = func(transpose(a))
d = alpha * func(transpose(a)
 
d = reshape(a,shape=[...])
d = reshape(a,shape=[...],order=[...])
d = func(reshape(a,...))
d = alpha * func(reshape(a,...))
 
d = spread(a,dim=k,ncopies=n)
d = func(spread(a,dim=k,ncopies=n))
d = alpha * func(spread(a,dim=k,ncopies=n))

The inputs to matmul() can also be permuted in cuTENSOR, and the result can be scaled and accumulated. That leads to several possible combinations, such as the following statements:

c = matmul(a,b)
c = c + matmul(a,b)
c = c - matmul(a,b)
c = c + alpha * matmul(a,b)
d = alpha * matmul(a,b) + beta * c
 
c = matmul(transpose(a),b)
c = matmul(reshape(a,shape=[...],order=[...]),b)
c = matmul(a,transpose(b))
c = matmul(a,reshape(b,shape=[...],order=[...]))

Using NVIDIA Tensor Cores from standard Fortran

Taking advantage of cuTENSOR and NVIDIA Tensor Cores can be as easy as the following code example when you use features for random number generation that are included in the cutensorex module:

program main
      use cutensorex
      integer, parameter :: ni=5120, nj=5120, nk=5120, ntimes=10
      real(8), allocatable, dimension(:,:) :: a, b, d
      allocate(a(ni,nk),b(nk,nj),d(ni,nj))
      call random_number(a)
      call random_number(b)
      d = 0.0d0
 
      print *,"cutensor"
      call cpu_time(t1)
      do nt = 1, ntimes
        d = d + matmul(a,b)
      end do
      call cpu_time(t2)
 
      flops = 2.0*ni*nj*nk
      flops = flops*ntimes
      print *,"times",t2,t1,t2-t1
      print *,"GFlops",flops/(t2-t1)/1.e9
      end program

The matmul() intrinsic call is mapped to cuTENSOR calls that seamlessly use Tensor Cores wherever possible. I show some performance results later in this post.

Compiling the program with nvfortran

You may ask how this program uses cuTENSOR, when I stated earlier that the cutensorex interfaces only map operations on GPU device arrays to cuTENSOR calls. The answer lies in how the program is compiled:

% nvfortran -acc -gpu=managed -cuda -cudalib main.f90

Here, I am compiling the program as an OpenACC program and taking advantage of the OpenACC managed memory mode in which all allocatable arrays are allocated in CUDA Unified Memory. With the addition of -cuda that enables CUDA Fortran extensions as well, the arrays are essentially CUDA Fortran–managed arrays. One rule of CUDA Fortran generic interface matching is to prefer the device interface for managed actual arguments, when both host and device interfaces exist.

The nvfortran compiler provides some shortcuts when the declarations, allocations, and use are in the same program unit. In general, it’s better to use OpenACC directives to instruct the compiler to pass device addresses, as in the following code example:

!$acc host_data use_device(a, b, d)
      do nt = 1, ntimes
        d = d + matmul(a,b)
      end do
!$acc end host_data

In this case, the -cuda compiler option is not required.

Using cuTENSOR from CUDA Fortran

For CUDA Fortran users, the cutensorex module and Fortran transformational intrinsics become a fast path to high performance and fully portable code. Use the !@cuf sentinel to add lines of code that are interpreted and compiled by the nvfortran CUDA Fortran compiler, or ignored as a comment by a standard Fortran compiler:

      program main
!@cuf use cutensorex
!@cuf use cudafor
      integer, parameter :: ni=5120, nj=5120, nk=5120, ntimes=10
      real(8), allocatable, dimension(:,:) :: a, b, d
!@cuf attributes(device) :: a, b, d
      allocate(a(ni,nk),b(nk,nj),d(ni,nj))
      call random_number(a)
      call random_number(b)
      d = 0.0d0
 
      print *,"cutensor"
      call cpu_time(t1)
      do nt = 1, ntimes
        d = d + matmul(a,b)
      end do
      call cpu_time(t2)
 
      flops = 2.0*ni*nj*nk
      flops = flops*ntimes
      print *,"times",t2,t1,t2-t1
      print *,"GFlops",flops/(t2-t1)/1.e9
      end program

On line 6, I declared the arrays with the device attribute, which places them in GPU device memory. However, they could be declared with the managed attribute as well. This program can be compiled and linked with the following command:

% nvfortran -Mcudalib main.cuf

Measured performance on real(8) data

Here’s a look at performance, starting with real(8) (double precision) data as used in the earlier examples. You can measure matrix multiply performance in several ways:

Single-threaded CPU implementation
Multi-threaded or multicore CPU implementation
Naively coded matrix multiply offloaded using directives
The matmul() intrinsic mapped to cuTENSOR

To get the best threaded-CPU performance, use the basic linear algebra subprogram (BLAS) library routine DGEMM. The equivalent DGEMM call to the earlier operation is the following command:

call dgemm('n','n',ni,nj,nk,1.0d0,a,ni,b,nk,1.0d0,d,ni)

To get an understanding of what tuned libraries can provide over a naive implementation, use the following OpenACC loop structure to run on the GPU. The loop structure uses no special tiling or hardware instructions.

!$acc kernels
      do j = 1, nj
         do i = 1, ni
            do k = 1, nk
               d(i,j) = d(i,j) + a(i,k) * b(k,j)
            end do
         end do
      end do
!$acc end kernels

Table 1 shows the real(8) performance attained on one NUMA node of a dual-socket AMD EPYC 7742 Rome CPU-based server, a single NVIDIA V100, and a single NVIDIA A100 GPU.

Implementation / Processor	TFLOPs
nvfortran matmul on a single CPU core	0.010
MKL DGEMM on 64 CPU cores	1.674
Naive OpenACC on V100	0.235
Naive OpenACC on A100	0.447
nvfortran matmul on V100	6.866
nvfortran matmul on A100	17.660

Table 1. real(8) performance on one NUMA node of a dual-socket AMD EPYC 7742 Rome CPU-based server, a single V100, and a single A100 GPU.

Not only do you get automatic GPU acceleration on V100 and A100 GPUs using the matmul() intrinsic, but on the A100 the mapping of matmul() to cuTENSOR calls gets you automatic use of FP64 Tensor Cores.

Measured performance on real(4) and real(2) data

You can perform the same set of runs using real(4) (single precision) data and calling SGEMM rather than DGEMM. In addition, the CUDA 11.0 cuTENSOR Fortran wrappers can take advantage of the A100 TF32 data type and Tensor Cores. Table 2 shows the performance for those runs.

Implementation / Processor	TFLOPs
nvfortran matmul on a single CPU core	0.025
MKL SGEMM on 64 CPU cores	3.017
Naive OpenACC on V100	0.460
Naive OpenACC on A100	0.946
nvfortran matmul on V100	10.706
nvfortran matmul on A100	14.621
nvfortran matmul on A100 using TF32	60.358

Table 2. real(4) performance on one NUMA node of a dual-socket AMD EPYC 7742 Rome CPU-based server, a single V100, and a single A100 GPU.

Why stop there? The nvfortran compiler supports 16-bit floating-point format (FP16) using the real(2) data type. You can change the types of the arrays in the earlier tests and run timings on half-precision as well.

Tensor Core operations were introduced on V100 for half-precision data, then extended on the A100 GPU to support TF32 and full double-precision DP64 Tensor Cores. While nvfortran supports real(2) and Tensor Cores on V100 and A100, it doesn’t support full and optimized real(2) on CPUs yet and neither do the standard BLAS libraries. In this case, it only makes sense to compare the performance of the GPU-accelerated versions (Table 3).

Implementation / Processor	TFLOPs
Naive OpenACC on V100	0.490
Naive OpenACC on A100	2.058
nvfortran matmul on V100	68.242
nvfortran matmul on A100	92.810

Table 3. real(2) performance on a single V100 and a single A100 GPU.

While the A100 performance is impressive and the code is fully portable, it is significantly below peak for TF32 and FP16. There is fixed overhead: at each call, you create and destroy the cuTENSOR tensor descriptors and create the contraction plan. You must also query and manage the workspace requirements used in the contraction, which may ultimately invoke cudaMalloc and cudaFree. If the overhead is 5–10% for FP64, that becomes closer to 25% for TF32 and around 35% for FP16, for this size of problem.

For developers who need ultimate performance, nvfortran does directly support Fortran interfaces to the C cuTENSOR API in the Fortran cutensor module, also provided within the HPC SDK. You can manage the tensor descriptors, plans, and workspace yourself.

Conclusion

In this post, I’ve shown some simple programs and the types of Fortran intrinsics calls and code patterns that can be automatically accelerated on GPUs. They can even automatically take advantage of Tensor Cores through cuTENSOR. With programs that are almost completely standard Fortran and fully portable to other compilers and systems, you can achieve near-peak performance on NVIDIA GPUs on matrix multiplication, matrix transpose, elemental array intrinsics, and multiple combinations of array syntax.

It’s impossible to predict what you might do or achieve with these new features. I look forward to seeing your feedback and results. NVIDIA continues to add more features that allow you to program NVIDIA GPUs with maximum performance using standard Fortran constructs.