NVIDIA HPC Fortran, C++ and C Compilers with OpenACC

Using NVIDIA HPC compilers for NVIDIA data center GPUs and X86-64, OpenPOWER and Arm Server multi-core CPUs, programmers can accelerate science and engineering applications using Standard C++ and Fortran parallel constructs, OpenACC directives and CUDA Fortran.


Code samples using three different GPU programming models. The NVIDIA HPC compilers split execution of an application across multicore CPUs and NVIDIA GPUs using standard language constructs, directives, or CUDA.


Key Advantages

The NVIDIA Fortran, C++ and C compilers enable cross-platform HPC programming for NVIDIA GPUs and multicore CPUs. They are fully interoperable with NVIDIA optimized math libraries, communication libraries, and performance tuning and debugging tools. Commercial support is available with NVIDIA HPC Compiler Support Services (HCSS).

Full C++17 including Parallel Algorithms

The NVC++ compiler supports all features of C++17 including automatic acceleration of the C++17 Parallel Algorithms on NVIDIA GPUs.

OpenACC for Fortran, C++ and C Applications

Accelerate HPC applications with OpenACC directives, the proven solution used by over 200 HPC applications for performance-portable GPU programming.

Easy Access to NVIDIA Tensor Cores

The NVFORTRAN compiler can automatically accelerate standard Fortran array intrinsics and array syntax on NVIDIA Tensor Core GPUs.

For x86-64, Arm and OpenPOWER CPUs

Develop HPC Applications on servers containing any mainstream CPU. The NVIDIA HPC Compilers are supported on over 99% of Top 500 systems.




World-Class CPU Performance, GPU Acceleration

NVIDIA HPC compilers deliver the performance you need on CPUs, with OpenACC and CUDA Fortran for HPC applications development on GPU-accelerated systems. OpenACC and CUDA programs can run several times faster on a single NVIDIA A100 GPU compared to all the cores of a dual-socket server, and interoperate with MPI and OpenMP to deliver the full power of today’s multi-GPU servers.



SPEC ACCEL  OpenACC and OpenMP performance comparison.

Running HPC industry standard benchmarks on the latest multi-core processors, NVIDIA OpenACC performance is equivalent to its OpenMP performance. The same OpenACC code running on the latest A100 GPU is over 4x faster.



Relative performance of C++ parallel algorithms and OpenACC of LULESH on GPUs compared to CPU.

Now, with the NVIDIA NVC++ compiler, standard C++17 parallel algorithms can run on either the CPU or the GPU without code changes. On the popular LULESH mini-app, A100 GPU performance is comparable to that of OpenACC, and over 7X faster than multithreaded CPU version.


Multi-GPU performance comparison of CloverLeaf.

The exact same OpenACC source code can be compiled to target both GPUs and multicore CPUs by simply changing the target architecture during compilation. A single A100 is 5x faster than a fast CPU node, and it scales well to 2, 4, or 8 GPUs using MPI+OpenACC.



Features



C++ Parallel Algorithms: Accelerated

The C++17 Standard introduced higher-level parallelism features that allow users to request parallelization of Standard Library algorithms by adding an execution policy as the first parameter to any algorithm that supports them. Most of the existing Standard C++ algorithms now support execution policies, and C++17 defined several new parallel algorithms, including the useful std::reduce and std::transform_reduce. The NVIDIA NVC++ compiler offers a comprehensive and high-performance implementation of the Parallel Algorithms for NVIDIA V100 and A100 datacenter GPUs, so you can get started with GPU programming using standard C++ that is portable to most C++ implementations for Linux, Windows, and macOS. The NVIDIA C++ Parallel Algorithms implementation is fully interoperable with OpenACC and CUDA for use in the same application.

Read Blog


Leverage NVIDIA Tensor Cores

NVIDIA A100 and V100 Datacenter GPU Tensor Cores enable fast FP16 matrix multiplication and accumulation into FP16 or FP32 results with performance 8x to16x faster than pure FP32 or FP64 in the same power envelope. NVIDIA A100 GPUs add Tensor Cores support for TF32 and FP64 data types, enabling scientists and engineers to dramatically accelerate suitable math library routines and applications using mixed-precision, single-precision or full double-precision. With the NVIDIA HPC Fortran compiler, you can leverage Tensor Cores in your CUDA Fortran and OpenACC applications through automatic mapping of Fortran array intrinsics to cuTENSOR library calls, or by using the CUDA API interface to Tensor Core programming in a pre-defined CUDA Fortran module.

Read Blog




OpenACC for GPUs and CPUs

NVIDIA HPC compilers support full OpenACC 2.6 and many OpenACC 2.7 features on both NVIDIA datacenter GPUs and multicore CPUs. Use OpenACC directives to incrementally parallelize and accelerate applications, starting with your most time-intensive loops and routines and gradually accelerating all appropriate parts of your application while retaining full portability to other compilers and systems. NVIDIA compilers leverage CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated x86-64, Arm and OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed, simplifying GPU acceleration of applications and allowing you to focus on parallelization and scalability of your algorithms.

Learn More


OpenMP for Multicore CPUs

NVIDIA HPC Fortran, C++ and C compilers include support for OpenMP 4.5 syntax and features. You can compile OpenMP 4.5 programs for parallel execution across all the cores of a multicore CPU or server. TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads. Multicore CPU performance remains one of the key strengths of the NVIDIA compilers, which now support all three major CPU families used in HPC systems: x86-64, 64-bit Arm and OpenPOWER. NVIDIA compilers deliver state-of-the-art SIMD vectorization and benefit from optimized single and double precision numerical intrinsic functions that use a uniform implementation across all types of CPUs to deliver consistent results across systems for both scalar and SIMD execution.

Learn More




Debug Programs with PCAST

NVIDIA Parallelizing Compiler Assisted Software Testing (PCAST) detects where and why results diverge between CPU and GPU-accelerated versions of code, between successive versions of a program you are optimizing incrementally, or between the same program executing on two different processor architectures. OpenACC auto-compare runs compute regions redundantly on both the CPU and GPU, and compares the GPU andCPU results. Difference reports are controlled by environment variables, letting you pinpoint where results diverge. The PCAST API lets you capture selected data and compare it against a separate execution of the program, and NVIDIA HPC compilers include a directive-based interface for the PCAST API, maintaining portability to other compilers and platforms.

Technical Blog: Detecting Divergence Using PCAST to Compare GPU to CPU Results




Who’s Using NVIDIA HPC Compilers

Over 200 GPU accelerated application ports have been initiated or in production using OpenACC and the NVIDIA HPC compilers including three of the top five HPC applications as reported in a 2016 Intersect360 site census survey. ANSYS Fluent, Gaussian and VASP .





What Users are Saying


We have with delight discovered the NVIDIA “stdpar” implementation of C++17 Parallel Algorithms. … We believe that the result produces state-of-the-art performance, is highly didactical, and introduces a paradigm shift in cross-platform CPU/GPU programming in the community.

Professor Jonas Latt, University of Geneva



Technical Blogs



Accelerating Standard C++ with GPUs Using stdpar

NVC++, the NVIDIA HPC C++ compiler, has recently added support for accelerating C++17 parallel algorithms on GPUs. Read the blog to get started.

Read Blog


Bringing Tensor Cores to Standard Fortran

Learn how to use NVIDIA Tensor Cores and the cuTENSOR library to seamlessly accelerate many Fortran array intrinsic and language constructs. Get started with this blog.

Read Blog


Accelerating Fortran DO CONCURRENT with GPUs

ISO Standard Fortran 2008 introduced the DO CONCURRENT construct which allows you to express loop-level parallelism. One of the several mechanisms for expressing parallelism directly in the Fortran language.

Read Blog






Get started with the HPC SDK.


Download Now