NVIDIA HPC Fortran,C and C++ Compilers with OpenACC
NVIDIA HPC Fortran, C++ and C Compilers with OpenACC
Using NVIDIA HPC compilers for NVIDIA data center GPUs and X86-64, OpenPOWER and Arm Server multi-core CPUs, programmers can accelerate science and engineering applications using Standard C++ and Fortran parallel constructs, OpenACC directives and CUDA Fortran.
The NVIDIA Fortran, C++ and C compilers enable cross-platform HPC programming for NVIDIA GPUs and multicore CPUs. They are fully interoperable with NVIDIA optimized math libraries, communication libraries, and performance tuning and debugging tools. Commercial support is available with NVIDIA HPC Compiler Support Services (HCSS).
Full C++17 including Parallel Algorithms
The NVC++ compiler supports all features of C++17 including automatic acceleration of the C++17 Parallel Algorithms on NVIDIA GPUs.
OpenACC for Fortran, C++ and C Applications
Accelerate HPC applications with OpenACC directives, the proven solution used by over 200 HPC applications for performance-portable GPU programming.
Easy Access to NVIDIA Tensor Cores
The NVFORTRAN compiler can automatically accelerate standard Fortran array intrinsics and array syntax on NVIDIA Tensor Core GPUs.
For x86-64, Arm and OpenPOWER CPUs
Develop HPC Applications on servers containing any mainstream CPU. The NVIDIA HPC Compilers are supported on over 99% of Top 500 systems.
World-Class CPU Performance, GPU Acceleration
NVIDIA HPC compilers deliver the performance you need on CPUs, with OpenACC and CUDA Fortran for HPC applications development on GPU-accelerated systems. OpenACC and CUDA programs can run several times faster on a single NVIDIA A100 GPU compared to all the cores of a dual-socket server, and interoperate with MPI and OpenMP to deliver the full power of today’s multi-GPU servers.
C++ Parallel Algorithms: Accelerated
The C++17 Standard introduced higher-level parallelism features that allow users to request parallelization of Standard Library algorithms by adding an execution policy as the first parameter to any algorithm that supports them. Most of the existing Standard C++ algorithms now support execution policies, and C++17 defined several new parallel algorithms, including the useful std::reduce and std::transform_reduce. The NVIDIA NVC++ compiler offers a comprehensive and high-performance implementation of the Parallel Algorithms for NVIDIA V100 and A100 datacenter GPUs, so you can get started with GPU programming using standard C++ that is portable to most C++ implementations for Linux, Windows, and macOS. The NVIDIA C++ Parallel Algorithms implementation is fully interoperable with OpenACC and CUDA for use in the same application.
Leverage NVIDIA Tensor Cores
NVIDIA A100 and V100 Datacenter GPU Tensor Cores enable fast FP16 matrix multiplication and accumulation into FP16 or FP32 results with performance 8x to16x faster than pure FP32 or FP64 in the same power envelope. NVIDIA A100 GPUs add Tensor Cores support for TF32 and FP64 data types, enabling scientists and engineers to dramatically accelerate suitable math library routines and applications using mixed-precision, single-precision or full double-precision. With the NVIDIA HPC Fortran compiler, you can leverage Tensor Cores in your CUDA Fortran and OpenACC applications through automatic mapping of Fortran array intrinsics to cuTENSOR library calls, or by using the CUDA API interface to Tensor Core programming in a pre-defined CUDA Fortran module.
OpenACC for GPUs and CPUs
NVIDIA HPC compilers support full OpenACC 2.6 and many OpenACC 2.7 features on both NVIDIA datacenter GPUs and multicore CPUs. Use OpenACC directives to incrementally parallelize and accelerate applications, starting with your most time-intensive loops and routines and gradually accelerating all appropriate parts of your application while retaining full portability to other compilers and systems. NVIDIA compilers leverage CUDA Unified Memory to simplify OpenACC programming on GPU-accelerated x86-64, Arm and OpenPOWER processor-based servers. When OpenACC allocatable data is placed in CUDA Unified Memory, no explicit data movement or data directives are needed, simplifying GPU acceleration of applications and allowing you to focus on parallelization and scalability of your algorithms.
OpenMP for Multicore CPUs
NVIDIA HPC Fortran, C++ and C compilers include support for OpenMP 4.5 syntax and features. You can compile OpenMP 4.5 programs for parallel execution across all the cores of a multicore CPU or server. TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads. Multicore CPU performance remains one of the key strengths of the NVIDIA compilers, which now support all three major CPU families used in HPC systems: x86-64, 64-bit Arm and OpenPOWER. NVIDIA compilers deliver state-of-the-art SIMD vectorization and benefit from optimized single and double precision numerical intrinsic functions that use a uniform implementation across all types of CPUs to deliver consistent results across systems for both scalar and SIMD execution.
Debug Programs with PCAST
NVIDIA Parallelizing Compiler Assisted Software Testing (PCAST) detects where and why results diverge between CPU and GPU-accelerated versions of code, between successive versions of a program you are optimizing incrementally, or between the same program executing on two different processor architectures. OpenACC auto-compare runs compute regions redundantly on both the CPU and GPU, and compares the GPU andCPU results. Difference reports are controlled by environment variables, letting you pinpoint where results diverge. The PCAST API lets you capture selected data and compare it against a separate execution of the program, and NVIDIA HPC compilers include a directive-based interface for the PCAST API, maintaining portability to other compilers and platforms.
Developer Blog: Detecting Divergence Using PCAST to Compare GPU to CPU Results
What Users are Saying
We have with delight discovered the NVIDIA “stdpar” implementation of C++17 Parallel Algorithms. … We believe that the result produces state-of-the-art performance, is highly didactical, and introduces a paradigm shift in cross-platform CPU/GPU programming in the community.Professor Jonas Latt, University of Geneva
Accelerating Standard C++ with GPUs Using stdpar
NVC++, the NVIDIA HPC C++ compiler, has recently added support for accelerating C++17 parallel algorithms on GPUs. Read the blog to get started.
Bringing Tensor Cores to Standard Fortran
Learn how to use NVIDIA Tensor Cores and the cuTENSOR library to seamlessly accelerate many Fortran array intrinsic and language constructs. Get started with this blog.
Accelerating Fortran DO CONCURRENT with GPUs
ISO Standard Fortran 2008 introduced the DO CONCURRENT construct which allows you to express loop-level parallelism. One of the several mechanisms for expressing parallelism directly in the Fortran language.