The new hardware developments in NVIDIA Grace Hopper Superchip systems enable some dramatic changes to the way developers approach GPU programming. Most notably, the bidirectional, high-bandwidth, and cache-coherent connection between CPU and GPU memory means that the user can develop their application for both processors while using a single, unified address space.

Each processor retains its own physical memory that is designed with the bandwidth, latency, and capacity characteristics that are matched to the workloads most suited for each processor. Code written for existing discrete-memory GPU systems will continue to run performantly without modification for the new NVIDIA Grace Hopper architecture.

Our recent post, Simplifying GPU Application Development with Heterogeneous Memory Management, details some of the benefits that a single-address space brings to developers and how it works on systems with NVIDIA GPUs connected to x86_64 CPUs through PCIe. All application threads (GPU or CPU) can directly access all of the application’s system allocated memory, removing the need to copy data between processors.

This new ability to directly read or write to the full application memory address space significantly improves programmer productivity for all programming models built on top of CUDA: CUDA C++, CUDA Fortran, standard parallelism in ISO C++ and ISO Fortran, OpenACC, OpenMP, and many others.

This post continues the Heterogeneous Memory Management (HMM) discussion in the context of Grace Hopper hardware, which provides all of the same programming model improvements as HMM-enabled systems, but with added hardware support to make it even better.

Notably, any workload that is bottlenecked by host-to-device or device-to-host transfers can get up to a 7x speedup due to the chip-to-chip (C2C) interconnect in Grace Hopper systems. Cache coherency enables this performance without having to pin the memory (using cudaHostRegister, for example) if huge pages are used. While HMM and CUDA Managed Memory have historically been limited to migrating whole pages of data reactively on page faults, Grace Hopper is able to make better decisions on where data should reside and when it should migrate.

We will detail in this post how the NVIDIA HPC compilers take advantage of these new hardware capabilities to simplify GPU programming with ISO C++, ISO Fortran, OpenACC, and CUDA Fortran.

NVIDIA Grace Hopper systems provide the best performance available combined with a simplified GPU developer experience. HMM also brings this simplified developer experience to systems that are not Grace Hopper, while providing the optimal performance available when using PCIe. Developers can use these improved and simplified programming models in a portable way to get the best performance available on a wide variety of systems using NVIDIA GPUs.

Expanding stdpar with Grace Hopper unified memory

Standard languages such as ISO C++ and ISO Fortran have been gaining features in recent years to enable developers to express the parallelism that is present in their applications directly from the base language itself, without using extensions or compiler directives. The NVIDIA HPC compilers can build these applications to run with high performance on NVIDIA GPUs. These features have been detailed in previous posts.

More specifically, we showed how using standard language parallelism, also known as stdpar, can be used to greatly improve developer productivity and simplify GPU application development. However, we also pointed out a few limitations due to the nature of the separate memory spaces of the CPU and GPU. These include the inability to use some types of data in the C++ parallel algorithms, such as data allocated on the stack, global data, or data captured by reference in lambda captures.

For Fortran do concurrent loops, global variables could not be used in routines called from within the do concurrent loop, and size detection of data by the compiler for assumed-size arrays was limited. Now, Grace Hopper and its unified memory capabilities clear these limitations, and developing applications using stdpar becomes even simpler.

Simplifying OpenACC and CUDA Fortran

GPU application developers have long preferred both OpenACC and CUDA Fortran due to their convenience and power. Along with CUDA C++, they have stood the test of time and are used in production by a large number of applications in HPC centers around the world. Both of these models provide robust ways of managing data residency and optimizing data transfer and movement.

Now with the unified memory capabilities of Grace Hopper, application development can be dramatically simplified because these considerations for data location and movement can be automatically handled by the system. This reduces the effort spent on porting applications to run on GPUs and leaves more time for algorithm development.

For fine-tuning performance and optimizations, the developer may choose to selectively add information about data locality using the facilities already available in OpenACC and CUDA Fortran. The data information in existing applications written for discrete memory devices can be used to optimize for Grace Hopper’s unified memory without code changes.

Evaluating application performance using unified memory

The following sections explore several benchmarks and applications to understand how these new features not only simplify code development, but also impact expected runtime performance.

SPECaccel 2023 benchmark

The SPECaccel® 2023 benchmark suite focuses on single-accelerator performance using directives, OpenACC, and OpenMP. The benchmarks are intended to show general GPU performance and are a good proxy for how many HPC applications can use the new unified memory features of Grace Hopper.

Figure 1 compares the performance of OpenACC data directives to unified memory enabled through the NVHPC SDK compiler flag -gpu=unified. While the results followed the benchmark’s run rule requirements, they were measured on preproduction hardware, so are considered estimated.

A bar chart comparing the estimated performance of 13 benchmarks in the SPECaccel 2023 suite when using data directives compared to unified memory. The majority of the bars show very little performance difference between the two versions. — *Figure 1. Estimated performance of multiple SPECaccel 2023 benchmarks using data directives compared to unified memory*

Most of the benchmarks show very little difference between using unified memory and memory managed with the OpenACC data directives with an overall slow-down of only ~1%. 463.swim, which primarily measures the memory performance, gains 28% with unified memory. With the data directives, an entire array is copied back for each time cycle, though only the inner triangular portion of the array is used on the host.

Given that the printed data is non-contiguous, with data directives, it is advantageous to copy the entire array as one large block rather than many smaller blocks. With unified memory, far less data is accessed on the host, with only a portion of the array fetched from GPU memory.

The only significant slow-down is with the 404.lbm benchmark at 22%. The kernel times for each iteration have a slight overhead of 2 ms when using unified memory. Given that the kernel is executed 2,000 times, the overhead gets accumulated for about 3% of the difference. The larger issue is the entire 5 GB results array is check-pointed every 63 iterations, which needs to be accessed from the host. In this case, the CPU accessing the GPU memory roughly doubles the time, accounting for the remaining 19% of the difference.

While unified memory makes porting code significantly easier, and as in the case of SPECaccel, it generally gives the same performance as using data directives. Programmers still need to be mindful of data placement, as for any other multi-socket system with non-uniform memory access (NUMA) characteristics. However, in most cases, data directives can now be considered a performance tuning option for cases such as 404.lbm where large portions of data are accessed on both the CPU and GPU, rather than a requirement to port code to the GPU.

SPEC and SPECaccel are registered trademarks of the Standard Performance Evaluation Corporation.

LULESH

LULESH is a mini-application designed to simulate a simplified version of shock hydrodynamics representative of LLNL’s ALE3D application. It has been used for over a decade to understand C++ parallel programming models and their interaction with compilers and memory allocators.

The stdpar implementation of LULESH uses C++ standard library containers for all the data structures on the GPU, and they depend on the automatic migration of memory between CPU and GPU.

Figure 2 shows that using unified memory does not affect the performance of LULESH, which makes sense. Both managed and unified memory options lead to a LULESH figure-of-merit (FOM) of 2.09e5 on NVIDIA DGX GH200, which is 40% higher than the FOM than with an NVIDIA H100 PCIe GPU and 6.5x faster than a 56-core Intel Xeon 8480+ CPU system.

A bar chart comparing the performance of LULESH when run in multiple ways. The performance on an Intel Xeon 8480+ is the baseline. The H100 PCIe bar is 4.61x faster. The performance of the GH200 using managed memory is 6.51x and with the compiler’s unified mode is 6.49x. — *Figure 2. Comparison of LULESH performance using managed and unified memory options on NVIDIA GH200 with NVIDIA H100 PCIe and a modern CPU*

POT3D

POT3D approximates solar coronal magnetic fields by computing potential field solutions. It is developed by Predictive Science Inc. using modern Fortran. The application has historically run on GPUs using OpenACC, but the authors have now adopted a mixture of Fortran do concurrent to express data parallel loops and OpenACC for optimizing data movement with the GPU.

As presented in the GTC session, From Directives to DO CONCURRENT: A Case Study in Standard Parallelism, the stdpar version of the code performed roughly 10% slower than the optimized OpenACC code. If OpenACC is used to optimize the data movement of the stdpar version, the performance is nearly identical. This means that the same performance was achieved while maintaining roughly 2,000 fewer lines of code. Does unified memory change this?

Figure 3 shows the performance of POT3D on Grace Hopper built in two ways. The blue bar is the performance baseline, which is Fortran do concurrent loops for parallelism and OpenACC data directives to optimize data movement. The green bar is built using the -gpu=unified option on Grace Hopper and with all OpenACC directives removed.

The performance of the code is now the same as the fully optimized code without requiring any OpenACC. With the performance and productivity enhancements unified memory brings, POT3D can now be written in pure Fortran and get the same performance as the previously tuned OpenACC code.

A bar chart comparing performance of managing memory explicitly in POT3D and using the new unified memory mode. The performance using OpenACC for data management and building without any data directives is equal. — *Figure 3. POT3D performance using OpenACC data directives compared to Grace Hopper unified memory*

How to enable and use unified memory in the NVIDIA HPC SDK

As of NVHPC SDK release 23.11, developers aiming to use GPUs with unified memory capability can benefit from simplified programming interfaces. This release introduces a novel compilation mode to the nvc++, nvc, and nvfortran compilers, which can be enabled by passing the flag -gpu=unified.

This section is a deep dive into the specific enhancements for unified memory across various programming models supported by the NVHPC SDK, which leverage the capabilities of the underlying hardware and the CUDA runtime to automatically handle data placement and memory migration between CPU and GPU physical memory.

stdpar

For stdpar, all data access restrictions have been removed. This means global variables can be accessed from CPU or GPU, and unified memory compilation is now the default setting on compatible machines. However, when cross-compiling for different targets, -gpu=unified flag needs to be passed explicitly to enable the new programming interface.

In the original release of stdpar C++ within nvc++, lambda functions in parallel algorithms had several restrictions. These have now been completely lifted. Developers can freely use data across different parallel algorithms and sequential code. This enables capturing variables by reference and accessing global variables within parallel algorithms:

int init_val = 123;
void foo() {
  int my_array[ARRAY_SIZE];
  auto r = std::views::iota(0, ARRAY_SIZE);
  std::for_each(std::execution::par_unseq, r.begin(), r.end(),
                [&](auto i) { my_array[i] = init_val; });
}

If this code is compiled as shown below, the array my_array can be safely initialized on the GPU with each element being set in parallel using the value from the global variable init_val. Previously, accessing both my_array and init_val was not supported.

nvc++ -std=c++20 -stdpar -gpu=unified example.cpp

It is now also possible to use std::array safely with parallel algorithms, as illustrated by the example:

std::array<int, 10000> my_array = ...;
std::sort(std::execution::par, my_array.begin(), my_array.end());

The removal of data access limitations is a notable improvement, but it is important to remember that data races are still possible. For example, accessing global variables within parallel algorithms with simultaneous updates in different lambda instances running on the GPU.

Porting existing code to stdpar C++ and integrating third-party libraries is also simplified. When pointers to data used in parallel algorithms originate in allocation statements from separate files, those files no longer require compilation with nvc++ or -stdpar.

For standard Fortran, some variable uses were previously unsupported. Now, it is possible to access global variables in routines called from do concurrent loops. Additionally, there were cases where the compiler could not accurately determine variable sizes for implicit data movements between GPU and CPU. These cases can now be handled correctly on targets with unified memory:

subroutine r(a, b)
  integer :: a(*)
  integer :: b(:)
  do concurrent (i = 1 : size(b))
    a(b(i)) = i 
  enddo
end subroutine

In the example above, an access region of an assumed-size array a inside the do concurrent construct cannot be determined at compile time because the element index positions are taken from another array b initialized outside or the routine. This is no longer an issue when such code is compiled as follows:

nvfortran -stdpar -gpu=unified example.f90

The crucial aspect is that the compiler no longer requires precise knowledge of the data segment accessed within the loop. Automatic data transfers between GPU and CPU are now handled seamlessly by the CUDA runtime.

OpenACC

Now with unified memory mode, OpenACC programs no longer require explicit data clauses and directives. All variables are now accessible from the OpenACC compute regions. This implementation closely adheres to the shared memory mode detailed in the OpenACC specification.

The following C example illustrates an OpenACC parallel loop region that can now be executed correctly on GPUs without requiring any data clauses:

void set(int* ptr, int i, int j, int dim){
  int idx = i * dim + j;
  ptr[idx] = someval(i, j);
}

void fill2d(int* ptr, int dim){
#pragma acc parallel loop
  for (int i = 0; i < dim; i++)
    for (int j = 0; j < dim; j++)
      set(ptr, i, j, dim);
}

In C/C++, native language arrays are implicitly decayed into pointers when passed to functions. Therefore, the original array shape and size information is not preserved across the function invocations. Moreover, arrays with the dynamic size are represented by pointers. The use of pointers poses major challenges for the automatic code optimizations, as compilers lack essential information about the original data.

While the OpenACC compiler has strong support for detecting the data segment accessed in loops to perform the data movements to GPU implicitly, it cannot determine the segment in this case because the array is updated through ptr in another function set called inside the loop. Previously, supporting such cases in C was not possible. However, with ‌unified memory mode enabled as shown below, such examples are now fully supported:

nvc -acc -gpu=unified example.c

Without -gpu=unified the only way to guarantee correctness for this example is to change the line with the pragma directive:

#pragma acc parallel loop create(ptr[0:dim*dim]) copyout(ptr[0:dim*dim])

This explicitly instructs the OpenACC implementation about the precise data segment used within the parallel loop.

The Fortran example below illustrates how a global variable can now be accessed in the OpenACC routine without requiring any explicit annotations.

module m
integer :: globmin = 1234
contains
subroutine findmin(a)
!$acc routine seq
  integer, intent(in)  :: a(:)
  integer :: i
  do i = 1, size(a)
    if (a(i) .lt. globmin) then
      globmin = a(i)
    endif
  end do
end subroutine
end module m

When this example is compiled as shown below, the source does not need any OpenACC directives in order to access module variable globmin to read or update its value in the routine invoked from CPU and GPU.

nvfortran -acc -gpu=unified example.f90

Moreover, any access to globmin will be made to the same exact instance of the variable from CPU and GPU keeping its value synchronized automatically. Previously, such behavior could only be achieved by adding a combination of OpenACC declare and update directives in the source code.

In binaries compiled with -gpu=unified, the OpenACC runtime leverages data action information such as create/delete or copyin/copyout as optimizations to indicate preferable data placement to the CUDA runtime by means of memory hint APIs. For more details, see Simplifying GPU Application Development with Heterogeneous Memory Management.

Such actions originate either from the explicit data clauses in the source code or implicitly determined by the compiler. These optimizations can be used to fine-tune application performance by minimizing the amount of ‌automatic data migrations.

For the C example above, while adding data clauses create(ptr[0:dim*dim]) and copyout(ptr[0:dim*dim]) is optional with -gpu=unified, their use in the OpenACC parallel loop directive may lead to a performance uplift.

CUDA Fortran

The addition of -gpu=unified also simplifies CUDA Fortran programming by removing restrictions on CPU-declared variables passed as arguments to global or device routines executing on the GPU. Moreover, it now permits referencing module or common block variables in such routines without requiring explicit attributes. This change does not affect variables explicitly annotated with existing data attributes: device, managed, constant, shared, or pinned.

module m
integer :: globval
contains 
attributes(global) subroutine fill(a)
  integer :: a(*)
  i = threadIdx%x
  a(i) = globval
end subroutine
end module m
program example
  use m
  integer :: a(N)
  globval = 123
  call fill<<<1, N>>> (a)
  e = cudaDeviceSynchronize()
end program

In the example above, the CPU stack-allocated array a is initialized in the kernel fill on the GPU with the value from the global variable globval assigned in the CPU code. As shown, a kernel routine, which is an entry point for execution on GPU, is now enables to directly access variables declared in the regular CPU host.

Details common across programming models

Binaries that have not been compiled with the new -gpu=unified flag will retain their existing performance characteristics on systems with and without unified memory alike. However, binaries compiled with -gpu=unified can not guarantee correct execution on targets without unified memory capability. When linking the final binary for unified memory targets, passing -gpu=unified in the linker command line is required for correctness.

Many applications transitioning to architectures with unified memory can seamlessly recompile with -gpu=unified without any code modifications. In addition, stdpar C++ and CUDA Fortran object files, whether compiled with or without -gpu=unified, can be linked together. However, linking object files containing OpenACC directives or Fortran DC compiled differently with and without -gpu=unified is currently not supported.

Manual performance tuning for memory usage is currently achievable through the CUDA memory hints APIs for all the programming models that support unified memory, as well as through data directives for OpenACC programs.

The HPC SDK will continue to enhance support for unified memory in upcoming releases. For in-depth information regarding the current status, limitations, and future updates on this new functionality, refer to the NVIDIA HPC SDK documentation.

Summary

The features and performance explained in this post are just the beginning of what NVIDIA Grace Hopper Superchip architecture and the NVIDIA software stack are bringing to developers. Future developments in the driver, CUDA software stack, and the NVIDIA HPC compilers are expected to remove even more restrictions on the way that users write their code, and to improve the performance of the resulting applications.