Simulation / Modeling / Design

Unlock the Power of NVIDIA Grace and NVIDIA Hopper Architectures with Foundational HPC Software

An illustration representing HPC applications.

High-performance computing (HPC) powers applications in simulation and modeling, healthcare and life sciences, industry and engineering, and more. In the modern data center, HPC synergizes with AI, harnessing data in transformative new ways.

The performance and throughput demands of next-generation HPC applications call for an accelerated computing platform that can handle diverse workloads and has a tight coupling between the CPU and GPU. The NVIDIA Grace CPU and NVIDIA Hopper GPU are industry-leading hardware ecosystems for HPC development.

NVIDIA provides tools, libraries, and compilers to help developers take advantage of the NVIDIA Grace and NVIDIA Grace Hopper architectures. These tools support innovation and help applications make full use of accelerated computing. This foundational software stack provides the means for GPU acceleration, and porting and optimizing your applications on NVIDIA Grace-based systems. For more information about NVIDIA Grace compilers, tools, and libraries, see the NVIDIA Grace product page.


The new hardware developments in the NVIDIA Grace Hopper systems enable dramatic changes to the way developers approach GPU programming. Most notably, the bidirectional, high-bandwidth, and cache-coherent connection between CPU and GPU memory means that you can develop your application for both processors while using a single, unified address space. 

Each processor retains its own physical memory that is designed with the bandwidth, latency, and capacity characteristics matched to the workloads most suited for each processor. Code written for existing discrete-memory GPU systems continues to run performantly without modification for the new NVIDIA Grace Hopper architecture.

All application threads (GPU or CPU) can directly access the application’s system-allocated memory, removing the need to copy data between processors. This new ability to read or write directly to the full application memory address space significantly improves programmer productivity for all programming models built on top of NVIDIA CUDA:

  • CUDA C++
  • CUDA Fortran
  • Standard parallelism in ISO C++, ISO Fortran, OpenACC, OpenMP
  • …and many others

NVIDIA HPC SDK 23.11 introduces new unified memory programming support, enabling workloads bottlenecked by host-to-device or device-to-host transfers to achieve up to a 7x speedup due to the chip-to-chip (C2C) interconnect in NVIDIA Grace Hopper systems. Application development can also be dramatically simplified because considerations for data location and movement are handled automatically by the system.

For more information about how HPC compilers use these new hardware capabilities to simplify GPU programming with ISO C++, ISO Fortran, OpenACC, and CUDA Fortran, see Simplifying GPU Programming for HPC with the NVIDIA Grace Hopper Superchip.

Get started with the NVIDIA HPC SDK for free and download version 23.11 now. 

NVIDIA Performance Libraries 

NVIDIA has grown to become a full-stack, enterprise platform provider, now offering CPUs as well as GPUs and DPUs. NVIDIA math software offerings now support CPU-only workloads in addition to existing GPU-centric solutions. 

NVIDIA Performance Libraries (NVPL) are a collection of essential math libraries optimized for Arm 64-bit architectures. Many HPC applications rely on mathematical APIs like BLAS and LAPACK, which are crucial to their performance. NVPL math libraries are drop-in replacements for these standardized math APIs. 

They are optimized for the NVIDIA Grace CPU. Applications being ported to or built on NVIDIA Grace-based platforms can fully use the high-performance and high-efficiency architecture. A primary goal of NVPL is to provide developers and system administrators with the smoothest experience porting and deploying existing HPC applications to the NVIDIA Grace platform with no source code changes required to achieve maximal performance when using CPU-based, standardized math libraries.

The beta release of NVPL, available now, includes BLAS, LAPACK, FFT, RAND, and SPARSE to accelerate your applications on the NVIDIA Grace CPU. 

Learn more and download the NVPL beta.

NVIDIA CUDA Direct Sparse Solvers

A new standard math library is being introduced to the suite of NVIDIA GPU-accelerated libraries. The NVIDIA CUDA Direct Sparse Solvers library, NVIDIA cuDSS, is optimized for solving linear systems with very sparse matrices. While the first version of cuDSS supports execution on a single-GPU, multi-GPU, and multi-node support will be added in an upcoming release. 

Honeywell is one of the early adopters of cuDSS and is in the final phase of performance benchmarking in its UniSim Design process simulation product.

The cuDSS preview is available to download now. For more information about supported features, see the NVIDIA cuDSS documentation.


NVIDIA cuTENSOR 2.0 is a performant and flexible library for accelerating your applications at the intersection of HPC and AI.

In this major release, cuTENSOR 2.0 adds new features and performance improvements, including for arbitrarily high dimensional tensors. To make the new optimizations easily extensible across all tensor operations uniformly, while delivering high performance, the cuTENSOR 2.0 APIs have been completely revised with a focus on flexibility and extensibility.

cuTENSOR 2.0 API flowchart.
Figure 1. cuTENSOR APIs are now shared across different tensor operations

The plan-based multi-stage API is extended to all operations through a set of shared APIs. The new APIs can take opaque heap-allocated data structures as input for passing any operation-specific problem descriptors defined for that execution. 

cuTENSOR 2.0 also adds support for just-in-time (JIT) kernels. 

cuTENSOR 2.0 benchmark performance.
Figure 2. Average incremental performance improvements from using JIT for various input tensor types compared for two benchmarks: QC-like and Rand1000. Performance improvements from JIT are significant for QC-like test cases with high dimensional tensors

Using JIT kernels helps realize unparalleled performance by tuning the right configuration and optimization knobs for the target configuration at runtime, supporting a myriad of high-dimensional tensors not achievable through generic pre-compiled kernels that the library can ship.

cuTENSOR 2.0 speedup graph.
Figure 3. cuTENSOR 2.0.0 performance gains over the previous 1.7.0 version when tuned with JIT and other capabilities

Learn about migration and download cuTENSOR 2.0 now.

NVIDIA Grace CPU performance tuning with NVIDIA Nsight Systems 2023.4 

Applications on NVIDIA Grace-based platforms benefit from tuning instruction execution on the CPU cores, and from optimizing the CPU’s interaction with other hardware units in the system. When porting applications to NVIDIA Grace CPUs, insight into functions at the hardware level helps you configure your software for the new platform. 

NVIDIA Nsight Systems is a system-wide performance analysis tool that collects hardware and API metrics and correlates them on a unified timeline. For NVIDIA Grace CPU performance tuning, Nsight Systems samples instruction pointers and backtraces to visualize where CPU code is busiest, and how the CPU is using resources across the system. Nsight Systems also captures context switching to build a utilization graph for all the NVIDIA Grace CPU cores.

NVIDIA Grace CPU core event rates, like CPU cycles and instructions retired, show how the NVIDIA Grace cores are handling work. The summary view for backtrace samples also helps you quickly identify which instruction pointers are causing hotspots. 

Now available in Nsight Systems 2023.4, NVIDIA Grace CPU uncore event rates monitor activity outside of the cores—like NVLink-C2C and PCIe activity. Uncore metrics show how activity between sockets supports the work of the cores, helping you find ways to improve the NVIDIA Grace CPU’s integration with the rest of the system.

NVIDIA Grace CPU uncore and core event sampling in Nsight Systems 2023.4 help you find the best optimizations for code running on NVIDIA Grace. For more information about performance tuning, as well as tips on optimizing your CUDA code in conjunction, see the following video.

Video 1. NVIDIA Grace CPU Performance Tuning with NVIDIA Nsight Tools

Learn more and get started with Nsight Systems 2023.4. Nsight Systems is also available in the HPC SDK and CUDA Toolkit.

Accelerated computing for HPC

NVIDIA provides an ecosystem of tools, libraries, and compilers for accelerated computing on the NVIDIA Grace and Hopper architectures. The HPC software stack is foundational for research and science on NVIDIA data center silicon. 

Dive deeper into accelerated computing topics in the Developer Forums

Discuss (0)