Building High-Performance Applications in the Era of Accelerated Computing

AI is augmenting high-performance computing (HPC) with novel approaches to data processing, simulation, and modeling. Because of the computational requirements of these new AI workloads, HPC is scaling up at a rapid pace.

To enable applications to scale to multi-GPU and multi-node platforms, HPC tools and libraries must support that growth. NVIDIA provides a comprehensive ecosystem of accelerated HPC software solutions to help your application meet the demands of modern AI-driven workloads.

HPC SDK 24.3

In addition to bug fixes and improvements in the compile-time performance of the HPC compilers, HPC SDK 24.3 has new features supporting better development on the latest NVIDIA Grace Hopper systems.

The NVIDIA HPC compilers provide a unified memory compilation mode when using OpenMP Target Offload directives for GPU programming. This adds to the existing support for Grace Hopper and HMM systems unified memory in the OpenACC, CUDA Fortran, and Standard Parallelism (stdpar) programming models, which are enabled in nvc++ and nvfortran through the -gpu=unified command line flag.

For CUDA Fortran programs, the unified attribute has been added to provide additional type information that enables applications to be further optimized for unified memory systems, like Grace Hopper.

All these features and other performance enhancements are available now in the HPC SDK 24.3 release. For more information, see the HPC SDK 24.3 release notes.

NVIDIA Performance Libraries for the Grace CPU

AI models are transforming cloud, hyperscale, and scientific workloads. They are being rapidly scaled across these diverse configurations. The NVIDIA Grace CPU addresses the growing complexity and size of AI models by offering high performance, power efficiency, and high-bandwidth connectivity. It tightly couples the CPU and GPU in the NVIDIA data center.

To accelerate the CPU workloads in your application, NVIDIA Performance Libraries (NVPL) provide drop-in replacements for the industry-standard math libraries many applications use today. NVPL is optimized for the Grace CPU and enables you to port applications to the Grace architecture with no source code changes required.

NVPL is available now in HPC SDK 24.3. This release includes

BLAS and LAPACK libraries implementing the Netlib API
An FFT library implementing the FFTW API
Random number generator and sparse matrix BLAS libraries

NVPL is also available for standalone download, which includes NVPL TENSOR for accelerating deep learning and inference on Grace CPUs with Tensor contraction, reduction, and elementwise operations.

Tools for building and optimizing microservices

The demand for scalable solutions in cloud and high-performance computing applications is increasing rapidly. As applications scale across data centers and clouds, NVIDIA Nsight Developer Tools are evolving to help.

New features are being introduced to Nsight Systems 2024.2 to help you build and optimize microservices.

Video 1. Scale AI Applications to the Data Center and Cloud with NVIDIA Nsight Systems

Profiling support has been enhanced for container systems like Kubernetes and Docker, including CSP Kubernetes services from major providers including Azure, Amazon, Oracle, and Google.

Python scripts called recipes enable you to do single– and multi-node analysis as applications execute across the data center. Nsight Systems then visualizes key metrics using JupyterLab integration.

Available now, recipes for networking analysis reveal how compute cold-spots relate to communication. You can generate multi-node heat maps that identify where to optimize InfiniBand and NVLink throughput for peak performance.

To meet you where you are writing code, server development is enabled by a remote GUI streaming container. Nsight Systems also integrates seamlessly with Jupyter Lab, enabling you to profile code and view textual results directly in Jupyter or launch the GUI streaming container for in-depth analysis.

Download Nsight Systems 2024.2 today. Get started with tools and tutorials.

CUDA GPU-accelerated math libraries

CUDA GPU-accelerated math libraries enable peak performance in HPC applications. Available now, cuDSS (preview) is a GPU-accelerated, direct sparse solver library for solving linear systems of sparse matrices, common in autonomous driving and process simulations. For more information, see Spotlight: Honeywell Accelerates Industrial Process Simulation with NVIDIA cuDSS.

Basic linear algebra subroutines (BLAS) are foundational for AI and HPC applications. cuBLAS provides GPU-accelerated BLAS to execute them at peak performance. Available in CUDA Toolkit 12.4, cuBLAS adds grouped batched GEMM (general matrix multiplication) experimental support for single and double-precision computations. Grouped batch mode enables you to concurrently solve GEMMs with the following differences:

Dimensions (m, n, k)
Leading dimensions (lda, ldb, ldc)
Transpositions (transa, transb)
Scaling factors (alpha, beta)

Fusing numerical operations within a CUDA kernel reduces memory access overhead and kernel launch overhead, improving performance for GPU-accelerated applications. Both libraries are available now for standalone download:

cuBLASDx enables you to exploit fusing numerical operations for BLAS.
cuFFTDx provides this same functionality for Fast Fourier Transforms (FFT), frequently used in deep learning and computer vision applications.

Furthermore, cuTENSOR 2.0 is available now, overhauling the cuTENSOR library to improve speed and flexibility. cuTENSOR provides optimized routines for tensor computations—elementwise, reduction, and contraction—that accelerate training and inference for neural networks.

Version 2.0 upgrades the library in both performance and functionality, including support for just-in-time kernel compilation. For more information, see cuTENSOR 2.0: A Comprehensive Guide for Accelerating Tensor Computations.

Multi-GPU multi-node math libraries

Distributed computing provides infrastructure for the computational demands of AI. Large-scale data processing tasks are distributed and parallelized across multiple nodes and GPUs to speed up training and inference time. As HPC applications scale, foundational math libraries must also support the new multi-GPU multi-node landscape of computing.

CUDA math libraries provide key mathematical algorithms for these compute-intensive applications. Available now, host API extensions enable math libraries to solve exascale problems.

cuBLASMp (preview) is a high-performance, multi-process library for distributed, basic, dense linear algebra. It’s available in the HPC SDK and for standalone download. The library harnesses Tensor Core acceleration, while efficiently communicating between GPUs and synchronizing their processes.

NVIDIA also provides cuSOLVERMp for solving distributed, dense linear systems and eigenvalue problems, as well as cuFFTMp to solve FFTs on multi-GPU multi-node platforms.

Get started with CUDA math libraries today.

Conclusion

To enable applications to scale across multi-GPU multi-node platforms, NVIDIA provides an ecosystem of tools, libraries, and compilers for accelerated computing at scale. Accelerated computing is the engine for AI-powered, HPC applications. Dive deeper into accelerated computing topics in the Accelerated Computing developer forum.