Unlocking Tensor Core Performance with Floating Point Emulation in cuBLAS

NVIDIA CUDA-X math libraries provide the fundamental numerical building blocks that enable developers to deploy accelerated applications across multiple high-performance domains, including AI and scientific computing.

cuBLAS is a CUDA-X math library that consists of a highly optimized collection of basic linear algebra subroutines for matrix and vector operations that are specifically tuned to get the best possible performance across NVIDIA hardware using familiar and easy-to-use APIs.

The latest cuBLAS update in NVIDIA CUDA Toolkit 13.0 Update 2 introduces new APIs and implementations that significantly boost the performance of double-precision (FP64) matrix multiplications (matmuls). This is achieved through floating-point (FP) emulation on Tensor Cores found in GPU architectures such as NVIDIA GB200 NVL72 and NVIDIA RTX PRO 6000 Blackwell Server Edition. For comprehensive information on GPU compatibility for both FP32 and FP64 emulation, refer to the cuBLAS documentation.

This new emulated FP64 matmul implementation complements the recently released single-precision (FP32) matmul emulation. Developers can fine-tune the required accuracy for FP64 matrix multiplications, but by default cuBLAS maintains accuracy equivalent to or better than native hardware. It automatically assesses whether an operation will perform better using FP emulation (with accuracy preserved) or native hardware and then selects the optimal implementation.

This post explains cuBLAS capabilities in CUDA Toolkit 13.0 Update 2, including:

Seamless access to Tensor Core performance through familiar and straightforward developer APIs
FP32 emulation with Blackwell BF16 tensor cores that provide increased performance over native FP32 matrix multiplication while preserving accuracy
FP64 emulation with Blackwell INT8 tensor cores providing a safe, automatic performance increase with available fallback to native execution
FP emulation for increased performance across a variety of software domains and hardware platforms

This is the first release of FP64 matmul emulation with more advancements to follow in upcoming releases.

Floating-point emulation in practice

The cuBLAS library exposes two flavors for matmul emulation: the BF16x9 algorithm for FP32 and the Ozaki Scheme for FP64. The BF16x9 algorithm provides a static decomposition that can be used to performantly and safely emulate all normal and subnormal FP32 values using Blackwell BF16 tensor cores. However, a common challenge of emulating FP64 with the Ozaki Scheme is that the numerics of the problem necessitate different representations.

In other words, a single configuration cannot performantly and accurately emulate all FP64 values. Specifically, because the Ozaki Scheme uses a fixed-point representation for the operands after their exponents are aligned, the number of “mantissa bits” required is data dependent and must be greater than or equal to the 53 bits in the IEEE 754 FP64 representation to deliver the same or better accuracy.

To solve this problem, the cuBLAS library includes an automatic dynamic precision (ADP) framework which seamlessly analyzes inputs to determine if emulation can be safely leveraged for increased performance. If so, the emulation parameters are automatically configured to enable accuracy equal to or better than the native FP64 matmul.

Application results: ecTrans

When weather forecasting or climate modeling applications simulate the complex physics involved across the Earth’s atmosphere, oceans, and other systems, a grid is needed to discretize the domain and perform the calculations. The open source ecTrans library relies on linear algebra operations to perform the grid-based transformations that are used for the weather predictions of the Integrated Forecasting System (IFS).

As shown in Figure 1, using NVIDIA Blackwell Tensor Cores for FP32 emulation significantly improves performance in ecTrans by providing a 2.4x speedup to the matrix product computations.

A stack bar chart compares the performance of ecTrans forward and backward iterations using a GB200 NVL72. The time taken by SGEMM (BF16x9) is less than half that of SGEMM (Native FP32). — *Figure 1. Performance is improved by using Blackwell BF16 Tensor Cores for FP32 emulation to reduce the amount of time spent computing matrix products in ecTrans*

In addition to the increased performance, the numerical accuracy achieved with FP emulation is either equivalent or superior to the results when using native FP32. To validate this, 1,000 consecutive forward and backward transformations of the spectral transform onto real data fields from an actual simulation were repeated.

During this process, the error distribution of velocities (U and V) and temperature (T) using BF16x9 FP emulation were tracked and compared to the results obtained when using standard FP32 precision (the operational precision used at the European Centre for Medium-Range Weather Forecasts for daily forecasts).

Line graphs compare the error distribution of velocities (U and V) and temperature (T) in ecTrans. For the velocity variables, BF16x9 FP emulation shows a narrower error spread compared to native FP32, indicating higher numerical accuracy. For the temperature, BF16x9 and native FP32 give overlapping results. — Figure 2. Repeated forward and backward iterations result in error distributions that show the numerical accuracy of SGEMMs using BF16x9 FP emulation to be as good as or better than native FP32 in ecTrans

The probability density functions of the absolute errors are shown in Figure 2 across FP32, TF32, and BF16x9 FP emulation. These plots correspond to the likelihood of encountering an error if velocities and temperatures are randomly sampled. The closer the curves are to a delta function centered at 0, the more accurate the underlying implementation.

The results for TF32 are not present on the velocity plots due to the large error terms. Zooming out, large errors in the velocities and temperatures would become visible which demonstrates the sensitivity of weather modeling to precision. However, BF16x9 FP emulation not only has accuracy within acceptable ranges but shows the same or better accuracy when compared with native FP32, while exceeding the performance of FP32.

Application results: BerkeleyGW

The BerkeleyGW code is used by researchers to study physical properties of materials that emerge as a result of how electrons change energy states. It is a massively parallel code that has been used at full scale on leadership class supercomputers. Using GPUs with BerkeleyGW can lead to an 86x performance speedup over the CPU-only implementation and can be even further accelerated with FP emulation.

Using emulated complex FP64 matmuls (ZGEMM) in the CHISUM routine of the BerkeleyGW Epsilon module allows for some flexibility in determining the optimal balance between accuracy and performance. By default, cuBLAS uses its ADP framework to determine the parameters that will guarantee results as accurate as using native FP64. This is done automatically for users and results in the performance gains shown in Figure 3.

A bar chart compares the performance of the BerkeleyGW Epsilon module using different FP64 emulation configurations against native FP64. Emulated ZGEMM (ADP) shows a significant speedup over native FP64, and further performance gains are observed with emulated ZGEMM using 55 mantissa bits. — *Figure 3. Performance improvements in BerkeleyGW Epsilon module using Ozaki Scheme-based FP64 emulation for the ZGEMMs in the CHISUM calculation on Blackwell B200 compared to native FP64*

However, the cuBLAS API enables the user to further fine-tune the performance by using fewer bits for the FP64 emulated operations. For BerkeleyGW, two cases were measured. FP emulation with the default ADP setting as well as with a manually-set 55 mantissa bits both resulted in accuracy well within widely accepted tolerances (10E-10) compared to the reference values, with the 55 mantissa bits case providing even more acceleration.

The performance difference comes from ADP determining that more than 55 mantissa bits are required; however, the reduced precision with the manually set 55 mantissa bits does not have an impact on application-level accuracy for these tests. If more performance is desired, cuBLAS APIs enable you to adjust the precision used during emulation and explore if the resulting accuracy meets application needs.

Application results: Quantum Espresso

The open source Quantum Espresso (QE) collection of applications are used worldwide for materials science calculations based on density functional theory (DFT). The core of these applications is highly optimized for both scale-out distributed computation as well as for fine-grained parallelism within a node.

QE depends on efficient double-precision GEMMs to apply operators during each step of the fundamental iteration cycle for determining ground state energies of atoms and materials. This double-precision GEMM usage is similar to many other DFT-based applications, and so the performance improvements for Quantum Espresso realized from FP emulation are expected to translate to many other DFT applications as well.

For the results shown in Figure 4, the Ausurf benchmark dataset was used to measure both the quality of the numerical results and the performance of QE with FP emulation enabled in the cuBLAS library on an RTX PRO 6000 Blackwell Server Edition GPU.

A bar chart compares the end-to-end performance of the Ausurf benchmark across native FP64 and emulated FP64 with ADP, 55 mantissa bits, 47 mantissa bits, and 39 mantissa bits. Emulation with ADP provides a 1.5x speedup and further tuning with 39 mantissa bits achieves nearly a 3x speedup. — *Figure 4. Performance of the Ausurf benchmark on RTX PRO 6000 Blackwell Server Edition across native FP64 and several configurations of emulated FP64*

Figure 4 shows that FP emulation with ADP provides a significant 1.5x end-to-end speedup, and with further tuning to 39 mantissa bits, a nearly 3x end-to-end speedup is achieved. For all configurations, the accuracy results are indistinguishable from one another until emulated FP64 with 39 mantissa bits are used. This produces application output values that are consistent up to 12 (base-10) significant digits.

The performance difference between ADP and 55 mantissa bits is due to the ADP framework determining that more than 55 mantissa bits are required for IEEE 754 FP64 level accuracy; however, in practice, using fewer mantissa bits does not impact the measured application-level accuracy.

Benchmarking results: Heat maps

In addition to end-to-end application performance improvements due to FP emulation, it is important to understand the applicability range of emulation when analyzing how emulation can improve your application’s performance. The three heat maps shown in Figures 5-7 demonstrate the performance improvements from using emulated matmuls across different matrix shapes on a GB200 NVL72 GPU for FP32 and FP64 and on an RTX PRO 6000 Blackwell Server Edition for FP64.

A heat map chart compares the GEMM performance of emulated FP32 to native FP32 on GB200 NVL72. Emulated FP32 shows up to a 3x speedup over native FP32. — *Figure 5. Performance improvements on a GB200 NVL72 GPU across many GEMM shapes of FP32 emulation versus native FP32*

A heat map chart compares the GEMM performance of emulated FP64 with ADP to native FP64 on GB200 NVL72. Emulated FP64 with ADP shows up to a 2.3x speedup over native FP64. — *Figure 6. Performance improvements on a GB200 NVL72 GPU across many GEMM shapes of FP64 emulation with ADP versus FP64*

A heat map chart compares the GEMM performance of emulated FP64 with ADP to native FP64 on RTX PRO 6000 Blackwell Server Edition. Emulated FP64 with ADP shows up to a 13x speedup over native FP64. — *Figure 7. Performance improvements on RTX PRO 6000 Blackwell Server Edition across many GEMM shapes of FP64 emulation with ADP versus FP64*

All three heat maps demonstrate substantial performance gains on moderate and large problem shapes. Additionally, in Figures 6 and 7, the ADP framework uses 55 mantissa bits and we can see that when the problems are too small to benefit from emulation, there are no performance penalties for attempting emulation due to cuBLAS heuristics choosing native FP64 algorithms. We expect further improvements to performance and the applicability region in future cuBLAS releases.

What’s next for FP emulation

While FP emulation is already accelerating real applications, NVIDIA is continuing to advance and improve this technology across several key impact areas. Additional key BLAS level-3 and LAPACK routines within the CUDA-X math libraries will be accelerated through both FP32 and FP64 emulation. The team will continue to improve FP64 emulation with optimizations to the ADP framework, GEMM kernels, reduced workspace memory requirements, and with the Ozaki-II Scheme.

Get started with floating point emulation in CUDA Toolkit 13.0 Update 2

Using the strategies discussed in this post, you can take advantage of Tensor Core performance for algorithms that use matrix multiplication without changing your code or requiring tedious performance analysis. cuBLAS will automatically choose the best strategy, delivering high performance while preserving the desired level of accuracy.

To start using FP emulation and exploring its benefits in your own applications, download CUDA Toolkit 13.0 Update 2.

To learn more, check out these related resources: