Fast Fourier Transform for NVIDIA GPUs

cuFFT, a library that provides GPU-accelerated Fast Fourier Transform (FFT) implementations, is used for building applications across disciplines, such as deep learning, computer vision, computational physics, molecular dynamics, quantum chemistry, and seismic and medical imaging. With cuFFT, applications automatically benefit from regular performance improvements and new GPU architectures. The cuFFT library is included in both the NVIDIA HPC SDK and the CUDA® Toolkit.

Available now: cuFFT LTO EA Preview

The early access preview of cuFFT adds support for enhanced LTO-enabled callback routines for Linux and Windows, boosting performance in callback use cases.

Download Now


The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued datasets. It’s one of the most important and widely used numerical algorithms in computational physics and general signal processing. The cuFFT library provides a simple interface for computing FFTs on an NVIDIA GPU, which allows users to quickly leverage the GPU’s floating-point power and parallelism in a highly optimized and tested FFT library.

cuFFT supports a wide range of FFT inputs and options efficiently on NVIDIA GPUs.

HPC SDK   |   CUDA Toolkit

  • 1D, 2D, 3D transforms of complex and real data types
  • Familiar API similar to the advanced interface of the Fastest Fourier Transform in the West (FFTW)
  • Flexible data layouts allowing arbitrary strides between individual elements and array dimensions
  • Streamed asynchronous execution
  • Half-, single-, and double-precision transforms
  • Batch execution
  • In-place and out-of-place transforms
  • Thread-safe and callable from multiple host threads

The cuFFT library is highly optimized for performance on NVIDIA GPUs. The chart below displays the performance boost achieved by moving to newer hardware—with zero code changes.

1D Single Precision FFT

1D Single Precision FFT

cuFFT LTO EA Preview

This early access preview of cuFFT library contains support for the new and enhanced LTO-enabled callback routines for Linux and Windows. LTO-enabled callbacks bring callback support for cuFFT on Windows for the first time. On Linux, these new and enhanced callbacks offer significant boost to performance in many callback use cases.

This preview builds upon nvJitLink, a library introduced by NVIDIA in the CUDA Toolkit 12.0, to leverage Just In Time Link Time Optimization (JIT LTO) for LTO-enabled callbacks by enabling runtime fusion of user callback code and library kernel code.

Download Now

  • Extension to the callback API to support LTO callback routines.
  • No offline device-linking required to use callbacks.
  • Adds callback support to the dynamic cuFFT library.
  • Adds callback support to Windows.
  • Compatible with existing callback device code.
  • Increased performance vs. the non-LTO callback routines for many cases.

The chart below compares the performance of running Complex-To-Complex FFTs with minimal load and store callbacks, between cuFFT LTO EA preview and cuFFT in the CUDA Toolkit 11.7 on an A100 (80GB) GPU.


Multi-GPU Support

When calculations are distributed across GPUs, cuFFT supports using up to 16 GPUs connected to a CPU to perform Fourier Transforms through its cuFFTXt API. Performance is a function of the bandwidth between the GPUs, the computational ability of the individual GPUs, and the type and number of FFTs to be performed.

HPC SDK   |   CUDA Toolkit

  • 1D, 2D, 3D transforms of complex and real data types
  • Support for up to 16-GPU systems
  • Support for multi-GPU complex-to-complex (C2C), real-to-complex (R2C), and C2R FFTss
  • Streamed asynchronous execution
  • Half-, single-, and double-precision transforms
  • Batch execution
  • In-place and out-of-place transforms

The chart below compares the performance of 16 NVIDIA Volta™ GV100 Tensor Core GPUs to the performance of eight NVIDIA Ampere Architecture GA100 Tensor Core GPUs for 3D C2C FP32 FFTs.

Sizes (cubed)

Multi-Node Support

The multi-node FFT functionality, available through the cuFFTMp API, enables scientists and engineers to solve distributed 2D and 3D FFTs in exascale problems. The library handles all the communications between machines, allowing users to focus on other aspects of their problems.


  • 2D and 3D distributed-memory FFTs
  • Slabs (1D) and pencils (2D) data decomposition, with arbitrary block sizes
  • MPI-compatible
  • Low-latency implementation using NVSHMEM, optimized for single-node and multi-node FFTs

Below compares multi-node weak scaling performance for distributed 3D FFT by precision, as the problem size and number of GPUs increase. The benchmark was achieved on the NVIDIA Selene supercomputer. Note that, for FP64 and size 16384 3, the data didn’t fit on the system.

Image alt text

Device Extensions

cuFFT Device Extensions (cuFFTDx) enable you to perform FFT calculations inside your CUDA kernel. Fusing numerical operations can decrease the latency and improve the performance of your application. These extensions can be downloaded in the MathDx package.


  • FFT embeddable into a CUDA kernel
  • High-performance, no-unnecessary data movement from and to global memory
  • Customizable with options to adjust selection of FFT routine for different needs (size, precision, batches, etc.)
  • Ability to fuse FFT kernels with other operations, saving global memory trips
  • Compatible with future versions of the CUDA Toolkit
  • Support for Windows

The chart below shows how cuFFTDx can provide over a 2X performance boost compared to cuFFT host calls when executing convolution with 1D FFTs.

FFT Sizes (1D)