The latest release of NVIDIA cuBLAS library, version 12.5, continues to deliver functionality and performance to deep learning (DL) and high-performance computing (HPC) workloads. This post provides an overview of the following updates on cuBLAS matrix multiplications (matmuls) since version 12.0, and a walkthrough:
- Grouped GEMM APIs for single, double, and half precisions
- Latest LLM matmul performance on NVIDIA Hopper (H100 and H200) and NVIDIA Ada (L40S) GPUs
- A note on cuBLAS performance tuning options, benchmarking, and API recommendations
- Improved functional coverage in cuBLASLt
Grouped GEMM APIs
Grouped GEMM APIs can be viewed as a generalization of the batched APIs that enable different matrix sizes, transpositions, and scaling factors to be grouped and parallelized in one kernel launch.
One example where this approach provides speedup is the generation phase of a mixture-of-experts (MoE) model with batch sizes of 8 and 64 and FP16 inputs and outputs. In this example, the grouped GEMM API can achieve a 1.2x speedup over naive looping using the batched GEMM API.
This is impressive because the current grouped GEMM kernels only leverage warp-level MMA instructions. They have demonstrated that they can compete against the batched GEMM kernels, which leverage warp group-level mma (wgmma) instructions.
Two new sets of APIs are available in the cuBLAS library for Grouped GEMM support:
- cublas<t>gemmGroupedBatched for FP32 (including TF32) and FP64 precisions where <t> is S or D for single and double precision, respectively.
- cublasGemmGroupedBatchedEx for FP16, BF16, FP32 (including TF32), and FP64 precisions.
These APIs currently support variable shapes, transpositions, and scaling factors. Visit NVIDIA/CUDALibrarySamples on GitHub to see examples for cuBLAS Extension APIs and cuBLAS Level 3 APIs.
Latest LLM matmul performance on NVIDIA H100, H200, and L40S GPUs
The latest snapshot of matmul performance for NVIDIA H100, H200, and L40S GPUs is presented in Figure 1 for Llama 2 70B and GPT3 training workloads. These speedups are measured without locking GPU clocks and account for the number of times each GEMM is repeated in the workload. A speedup of nearly 3x and 5x can be seen on H200 compared to A100 for Llama 2 70B and GPT3 training phase, respectively. The final end-to-end speedup of the actual workload will depend on these speedups and the speedup of the non-GEMM fraction of each workload.
Library performance and benchmarking
This section covers the following topics:
- Runtime performance heuristics, or how cuBLAS library dispatches the fastest implementation on average. This should answer why users sometimes encounter performance gaps when comparing cuBLAS with other backends.
- Performance tuning API in the cuBLAS library to unlock faster implementations when available. This should answer how users can reach the best performance with cuBLAS before separate specialized kernels are needed.
Runtime heuristics
cuBLAS library leverages a recommender system at runtime to dispatch the fastest configuration possible for any user-requested matmuls. Each configuration includes implementations (kernels) and runtime launch parameters. This recommender system is trained on actual timing data from running a large number of problems (including multiple precisions, matrix shapes, layouts and epilogues) with several available configurations on the GPU.
Figure 2 demonstrates an example of how this data would look on a Hopper GPU fixed at some typical operating clock with performance normalized to the GPU peak at that clock.
The goal of this recommender system is then to choose the fastest implementation for each of these problem sizes at runtime. In the absence of auto-tuning, this condition can only be met on average across the problem space. Figure 3 shows the outcome of training a recommender system that achieves an accuracy of 93% (geomean) across the large problem space. In this example, runtime heuristics achieves 93% of the best available performance across the visualized dataset.
It’s important to note that there are always specific matmul problems where this recommender system would not return the best implementation out of the box, even if that implementation is available in the library.
Performance tuning with cuBLASLt APIs
This section describes how to ensure meaningful comparisons across different backends and to further close the gap to the best available implementation using cuBLAS heuristics API.
So far, the assumption has been that users leverage cuBLAS matmul API (that is, cubalsLtMatmul), which in turn dispatches the best implementation option according to the default heuristics summarized earlier. As this option is not always guaranteed to be strictly the fastest, the cuBLAS library exposes another API, cublasLtMatmulAlgoGetHeuristic, which enables users to easily perform auto-tuning to dispatch a faster implementation than the one returned by default runtime heuristics.
This API accepts the types that users would otherwise specify to the original matmul API. But instead of dispatching a matmul, it returns a number of options for users to iterate over, run once, and select a potentially better implementation. Visit NVIDIA/CUDALibrarySamples to see an example for auto-tuning in cuBLAS. Figure 4 shows the main sections of this example.
Since the performance tuning options on recent GPUs are only available in cublasLt* APIs, any benchmarking must be done with this set of APIs and by leveraging cublasLtMatmulAlgoGetHeuristic. Note that PyTorch matmul does not currently expose this ability and it would not reflect the best performance available if used as a proxy to benchmark cuBLAS. While a similar option, cublasGemmAlgo_t, is available in the cublasGemmEx API, this option is a no-op on NVIDIA Ampere architecture and newer GPUs.
Migrate to cublasLtMatmul API
It is recommended that advanced users of NVIDIA Ampere architecture and newer GPUs migrate from cublasGemmEx to cublasLtMatmul. In addition to having access to the performance tuning options discussed earlier, the transition to cublasLtMatmul APIs will unlock access to fused epilogues, and the increasing support for mixed-precision matrix multiplications. To facilitate this transition, see the cuBLASLt Library API examples.
Better functionality and performance in cuBLASLt
Since cuBLAS 12.0, the following enhancements have been included in the cuBLAS library:
- Fused epilogue support parity between BF16 and FP16 precisions on NVIDIA Ampere and NVIDIA Ada.
- Additional fused epilogues on NVIDIA Hopper and NVIDIA Ampere.
- Support for FP8 on Ada GPUs and performance update on Ada L4, L40, and L40S.
- Removal of M, N, and batch size limitations of cuBLASLt matmul API, which closes cuBLASLt functional gaps when compared to cuBLAS gemmEx API.
- Improved performance of heuristics cache for workloads with high eviction rate.
- cuBLAS symbols are available in CUDA Toolkit symbols for Linux repository. To learn more, see NVIDIA CUDA Toolkit Symbol Server.
To learn more about cuBLAS, see the documentation and samples.