Fast Fourier Transforms (FFTs) are widely used across scientific computing, from molecular dynamics and signal processing to computational fluid dynamics (CFD), wireless multimedia, and machine-learning applications. As computational problem sizes scale to increasingly large domains, researchers require the capability to distribute FFT computations across hundreds or thousands of GPUs spanning multiple nodes.

NVIDIA cuFFTMp is a multi-GPU, multi-node (MGMN) extension to cuFFT that enables scientists and engineers to solve challenging problems on exascale platforms. Building on the success of NVIDIA cuFFTMp’s scaling on Volta and Ampere supercomputers, as detailed in a previous blog, we continue to demonstrate the library’s robust scaling capabilities across the latest NVIDIA GPU architectures.

This blog will present new performance results for MGMN FFTs across cutting-edge cluster configurations. It showcases cuFFTMp’s ability to leverage the full potential of modern MGMN supercomputers.

How cuFFTMp performs and scales on NVIDIA Hopper Platforms

We benchmarked cuFFTMp on the NVIDIA Eos supercomputer and compared results against the Selene supercomputer from our previous blog. Eos is built with NVIDIA DGX H100 systems, where each node contains two Intel Xeon Platinum 8480C 56-core CPUs and eight H100s connected by an NVLink 4 Switch (900 GB/s, bidirectional). All nodes are interconnected by NVIDIA Quantum-2 InfiniBand (400 Gb/s/node, bidirectional).

In comparison to the networking hardware in Selene, Eos has 1.5x faster NVLink and 2x faster InfiniBand in terms of bandwidth. This comparison is particularly valuable as both systems represent state-of-the-art configurations from consecutive GPU generations, allowing us to quantify the real-world performance improvements that researchers can expect when upgrading their infrastructure from Ampere to Hopper platforms.

Bar chart shows cuFFTMp weak scaling on Eos, with DGX-H100 reaching 1.69 PFlop/s at 2048 GPUs (12288³ problem size) compared to DGX-A100's 851 TFlop/s. DGX-H100 achieves approximately 2x higher performance than DGX-A100 across all configurations from 8 to 2048 GPUs. — *Figure 1. cuFFTMp weak scaling performance on DGX-H100 across multiple FFT sizes up to 2048 GPUs*

Figure 1 above demonstrates cuFFTMp’s robust weak scaling across different problem sizes on the NVIDIA Eos supercomputer. And Figure 2 below shows the relative speedup of results of Eos over those on Selene. At every GPU count, cuFFTMp consistently achieves approximately 2x speedup (up to 2.09x) and reaches 1.69 PFlop/s at 2048 GPUs.

Bar chart shows DGX-H100 speedup over DGX-A100 ranging from 1.69x to 2.09x across eight to 2,048 GPUs, with peak speedup of 2.09x at 1,024 GPUs. — *Figure 2. cuFFTMp weak scaling speedup on DGX-H100 vs. DGX-A100*

Figure 3 below shows strong scaling of cuFFTMp for two problem sizes, 1,024³ and 2,048³. We observed a performance transition from a single node (eight GPUs) to two nodes (16 GPUs) due to the reduced peak bandwidth of InfiniBand when compared to NVLink. This is then followed by linear scaling up to 256 nodes (2,048 GPUs). The performance begins to plateau beyond 64 nodes and 128 nodes, respectively.

Line chart shows cuFFTMp strong scaling on DGX-H100 and DGX-A100 for 1,024³ and 2,048³ problem sizes across 1 to 256 nodes. DGX-H100 achieves approximately 2x faster execution times than DGX-A100, with the 2,048³ problem scaling down to 1.97 ms at 256 nodes. — *Figure 3. cuFFTMp strong scaling performance on DGX-H100 across multiple FFT sizes (1024³, 2048³)*

As shown in Figure 4 below, compared with DGX A100 results on problem size 2,048³, we obtain approximately 2x speedup when InfiniBand bandwidth is improved by 2x. In some cases, we achieve speedups exceeding 2x (e.g., 2.58x at 128 nodes) because of cuFFTMp’s state-of-the-art fused kernel implementations that address performance gaps identified in previous iterations.

Bar chart shows DGX-H100 speedup over DGX-A100 for 2048³ problem size ranging from 1.70x to 2.58x across 1 to 256 nodes, with peak speedup of 2.58x at 128 nodes. — *Figure 4. cuFFTMp strong scaling speedup of FFT size 2048³ on DGX-H100 vs. DGX-A100*

How cuFFTMp performs and scales on NVIDIA Blackwell platforms

NVIDIA’s Blackwell architecture introduces significant improvements in interconnect bandwidth. Importantly, Blackwell introduces Multi-Node NVLink (MNNVL) technology which extends high-bandwidth NVLink connectivity across multiple nodes within a rack. cuFFTMp leverages the underlying hardware topology and network characteristics to achieve optimal performance.

Figure 5 below demonstrates the results for 1,024³ FFTs on the latest NVIDIA Blackwell systems. The NVIDIA DGX B200 system consists of eight fully connected B200 GPUs with fifth-generation NVLink and NVLink Switch providing 2x the bandwidth of the previous generation (1.8 TB/s/GPU, bidirectional). The NVIDIA GB200 NVL72 system is the new MNNVL rack-scale system that consists of two 72-core Grace CPUs and four Blackwell GPUs per node, and 18 nodes per rack. GPUs in the same rack are interconnected by NVLink 5 Switches at peak bandwidth. For both systems, different racks are connected by NVIDIA Quantum-2 InfiniBand (400 Gb/s/node, bidirectional).

Line chart shows cuFFTMp strong scaling for 1024³ problem size across DGX-A100, DGX-H100, DGX-B200, and GB200-NVL72 from 1 to 1024 GPUs. GB200-NVL72 achieves the fastest execution time of 0.41 ms at 64 GPUs, outperforming DGX-H100 (0.94 ms at 1024 GPUs) and DGX-A100 (1.42 ms at 512 GPUs). — *Figure 5. cuFFTMp timing results for 1024³ FFTs across different cluster configurations*

We first observe approximately 3.1x speedup from Ampere to Blackwell systems on up to eight GPUs (1.7x from Hopper to Blackwell). For DGX B200 systems, the same performance transition at 16 GPUs observed in DGX H100 systems occurs because we hit the peak bandwidth of InfiniBand. However, GB200 NVL72 demonstrates consistent linear scaling up to 64 GPUs since all 64 GPUs remain within the same NVLink domain with peak NVLink bandwidth.

Notably, the execution time achieved by NVIDIA GB200 NVL72 at 64 GPUs cannot be attained even by 1,024 H100 GPUs due to accumulated communication latency overhead, demonstrating the fundamental architectural advantage of MNNVL for latency-sensitive workloads. At 128 GPUs, performance becomes a function of the InfiniBand connections between different racks of the NVIDIA GB200 NVL72. Across all benchmarks, performance in InfiniBand scale-out connected regimes converges, albeit at different GPU counts, as all InfiniBand connections share the same bandwidth characteristics.

How GROMACS uses cuFFTMp for biomolecular simulations

GROMACS, a scientific software package widely used for simulating biomolecular systems, plays a crucial role in comprehending biological processes important for disease prevention and treatment. In a previous blog we described how integration of cuFFTMp into GROMACS allowed huge scalability improvements. We now showcase a further breakthrough by running this software on the new Grace-Blackwell MNNVL architecture.

We benchmarked the 5.9 million atom “PRS2” test case, an important system for novel research into synaptic vesicle fusion simulations: These reveal how key proteins involved in neurotransmitter release allow key structures in neurotransmitters to fuse with plasma membranes, a critical step in nerve cell communication. We also include the larger 12.4 million atom “benchPEP-h” case, to allow direct comparison with the previous blog.

Two-panel line chart showing GROMACS performance. Left panel: PRS2 benchmark (5.9M atoms) with Grace+B200 reaching 267 ns/day at 16 nodes, approximately 2x faster than X86+H100. Right panel: BenchPEP-h benchmark (12.5M atoms) with Grace+B200 reaching 190 ns/day at 18 nodes, outperforming both X86+H100 and X86+A100. — Figure 6. Performance of the GROMACS v2025.1 PRS2 (left) and BenchPEP-h (right) benchmarks (high is good) with increasing node count. Included are results from running on two NVIDIA-internal systems: the Eos supercomputer with DGX-H100s and the GB200-NVL72 MNNVL system. Four GPUs are used per node, one of which is assigned to particle mesh ewald (PME) tasks to calculate long-range forces and the other three to particle-particle (PP) tasks to calculate short-range forces.

In the right plot of Figure 6, we see that the combination of NVIDIA Grace Blackwell and MNNVL is allowing the larger benchPEP-h case to scale up to almost 200 ns/day (nanoseconds of simulation performed per day of computation), nearly 2x faster than the x86+Hopper/InfiniBand result and over 3x faster than the equivalent x86+Ampere/InfiniBand result reported in the previous blog.

In the left plot of Figure 6, we see an even larger scaling and performance advantage for the smaller PRS2 system (which is more sensitive to communication cost than benchPEP-h). This demonstrates that the new Grace Blackwell MNNVL architecture is not only vital for scaling modern AI, but is also well suited to molecular dynamics: The high bandwidth and low-latency multi-node communications allow researchers to obtain simulation results faster than ever before.

Get started with cuFFTMp

For researchers and developers interested in leveraging these performance improvements, cuFFTMp can now be downloaded from the NVIDIA Developer Zone as a standalone package or as part of the NVIDIA HPC SDK stack. Code samples and API usage can be found in cuFFTMp documentation.

cuFFTMp depends on NVIDIA NVSHMEM for GPU-initiated communications, and its compatibility with NVSHMEM has improved significantly over the past few releases. Starting from cuFFTMp 11.2.6, NVSHMEM ABI backward compatibility between host and device libraries is supported within a major NVSHMEM version. This allows independent NVSHMEM updates within major versions without the need to update cuFFTMp.

Python users have multiple ways to access cuFFTMp features. Those who need a truly pythonic experience can scale distributed FFTs to thousands of GPUs with the nvmath-python library, which delivers the same performance as the native CUDA implementation. The library also provides Python bindings for a low-level integration. Below is a short example of computing distributed FFTs using nvmath-python with cuFFTMp backend.

import numpy as np
import cuda.core.experimental
import nvmath.distributed

from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
nranks = comm.Get_size()
device_id = rank % cuda.core.experimental.system.num_devices
nvmath.distributed.initialize(device_id, comm, backends=["nvshmem"])

shape = 64, 256 // nranks, 128
a = np.random.rand(*shape) + 1j * np.random.rand(*shape)
b = nvmath.distributed.fft.fft(a, distribution=nvmath.distributed.distribution.Slab.Y)

Those who prefer pure cuFFTMp binaries can download those from PyPI or conda-forge.

The best developments come out of your feedback, so please contact us with your suggestions at: mathlibs-feedback@nvidia.com.

Leopold Cambier, Miguel Ferrer Avila, and Lukasz Ligowski contributed to the work mentioned in this blog.