Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a GEMM kernel is determined by an array of compile-time and runtime meta-parameters: CTA, warp and instruction level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, split-k factors, and so on.

The traditional approach for finding the optimal kernel involves generating thousands of potential kernel configurations, compiling them, and then running an exhaustive auto-tuning process to find the fastest one. This entire workflow can take many hours, creating a significant bottleneck for developers and preventing adoption in offline compiled libraries, such as CUTLASS. This is a particular issue for JIT-compiling libraries such as Torch Inductor or OpenAI Triton, where fast model compilation is critical. The friction can lead users to settle for suboptimal kernels simply to avoid this lengthy and complex tuning process.

This post introduces NVIDIA Matmul Heuristics (nvMatmulHeuristics), a GPU kernel meta-parameter optimization module that provides fast heuristics for GEMMs, to address this challenge. The module analyzes the specific parameters of an operation and the target hardware capabilities to determine a small set of best kernel configurations that will deliver maximum performance. The example featured in this post uses CUTLASS 4.2.

Heuristics-based CUTLASS configuration and auto-tuning

The integration of nvMatmulHeuristics into the CUTLASS ecosystem dramatically improves the user experience by transforming the kernel generation and tuning process.

Instead of a brute-force approach, the integration introduces a new, more efficient workflow. For a given GEMM problem, a user can now leverage nvMatmulHeuristics to predict a small, targeted set of high-potential kernel configurations.

The workflow involves the following steps:

Heuristic prediction: nvMatmulHeuristics takes the GEMM problem definition (shape, data types, and so on), the target backend (in this case, CUTLASS), detects hardware properties and, based on its internal models, generates a short list of promising kernel configurations. These configurations include optimal CTA shapes, split-k factors, and other meta-parameters.
Kernel generation: This list of predicted configurations is passed to the CUTLASS Library kernel generator which generates only this small, relevant set of kernels, significantly shortening the hours-long process of compiling thousands of unnecessary variants.
Auto-tuning: The CUTLASS profiler takes the same list and auto-tunes over only the runtime parameters, still following nvMatmulHeuristics choices for the handful of compiled kernels. It then quickly finds the top-performing candidate.

This approach can dramatically reduce the end-to-end time required to find a high-performance kernel.

To use this feature in CUTLASS, first prepare a list of GEMM problems in JSON format. The following example is for a single FP16 GEMM, where tnn refers to row-major (Transposed) A, column-major (Not transposed) B, and column-major C/D.

[
{
     "m" : 4096,
     "n" : 4096,
     "k" : 4096,
     "batch_count" : 1,
     "layout" : "tnn",
     "dtype_a" : "f16",
     "dtype_b" : "f16",
     "dtype_c" : "f16",
     "dtype_acc" : "f32",
     "dtype_d" : "f16"
}
]

Next, on a machine with your target GPU, build CUTLASS as usual and provide -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE=<path_to_your_problem_list.json> and -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=N, where N is the number of configurations nvMatmulHeuristics will emit for each GEMM in the input list. For example:

cmake ${SRC_DIR} \
    -DCUTLASS_NVCC_ARCHS=90a \
    -DCUTLASS_LIBRARY_HEURISTICS_PROBLEMS_FILE=<path_to_your_problem_list.json> \
    -DCUTLASS_LIBRARY_HEURISTICS_CONFIGS_PER_PROBLEM=8 \
    -DCUTLASS_LIBRARY_HEURISTICS_TESTLIST_FILE=my_testlist_file.csv
    -DCMAKE_BUILD_TYPE=Release
...
...

make cutlass_profiler -j

The cmake step will produce a CSV test list, which lists all test cases that need to be run to perform auto-tuning over the emitted configurations. You can use this with custom benchmarking code to run your own auto-tuning or use cutlass_profiler, which can consume this CSV and run the configurations out of the box:

cutlass_profiler --operation=Gemm --testlist-file=my_testlist_file.csv 
--profiling-iterations=0 --profiling-duration=50 --verification-enabled=false 
--output=<path_to_outfile>

For consistent profiling results, run with locked clocks.

To learn more about this feature, see the CUTLASS documentation.

nvMatmulHeuristics is now available in early access

nvMatmulHeuristics is a core part of the cuBLAS heuristics and is now available in early access for general use, along with an integration into the CUTLASS library. Key aspects include:

Fast analytical heuristics and automatic GPU configuration detection (for example, green context, MIG, CUDA in graphics, clocks)
Support for CUTLASS and other GEMM backends. Backend support space can also be modified, for instance, by restricting the supported CTA tile shapes, split-k factors, and so on
NVIDIA Ampere, Ada, Hopper, and (preliminary) Blackwell GPU architecture support for all Tensor Core-based GEMM precisions (FP4, FP8, FP16/FB16, TF32, INT8, and more)
Python and C++ APIs

To get started, install nvMatmulHeuristics:

pip install nvidia-matmul-heuristics

You can then query the predicted top-N GEMM configurations. For an FP16 input/output and FP32 compute (HSH) example:

from nvMatmulHeuristics import *

# Load interface
nvmmh = NvMatmulHeuristicsInterface(NvMatmulHeuristicsTarget.CUTLASS3, 
                                    precision='HSH', 
                                    flags=NvMatmulHeuristicsFlags.PERF_MODEL_BASED_AUTO_TUNING)

# Create Hardware descriptor
# hw can be "None" to use the system's GPU instead
hw = nvmmh.createHardwareDescriptor()
nvmmh.setHardwarePredefinedGpu(hw, NvMatmulHeuristicsNvidiaGpu.H200_SXM)

# Select layout
layout = NvMatmulHeuristicsMatmulLayout.NN_ROW_MAJOR

# Load internal discovery set for improved accuracy
assert nvmmh.loadInternalDiscoverySet(layout, hw)

# Get best configurations
configs = nvmmh.get_with_mnk(4000, 16, 32768, layout, 8, hw)

# Print results
print(f"Found {len(configs)} configurations:\n")
for i, config in enumerate(sorted(configs, key=lambda d: d['runtime']), 1):
    print(f"Configuration {i}:")
    print(f"  Kernel: {config['kernel']}")
    print(f"  Estimated runtime: {config['runtime'] * 1000:.6f} ms")

Output:

Found 8 configurations:

Configuration 1:
  Kernel: layout(NN_ROW) stages(6) cta(128 16 128) warp(64 8 128) instr(64 8 16) splitK(4) swizz(1) ctaOrder(0) cluster(2 1)
  Estimated runtime: 0.083215 ms
...
Configuration 8:
  Kernel: layout(NN_ROW) stages(8) cta(64 16 64) warp(64 8 64) instr(64 8 16) splitK(1) swizz(1) ctaOrder(0) cluster(4 1)
  Estimated runtime: 0.102996 ms

Given the eight configurations, it is now possible to compile a small set of eight CUTLASS kernels and select the one that performs best for the problem at hand.

How does nvMatmulHeuristics perform?

By focusing the build and profiling effort on a small number of candidates recommended by nvMatmulHeuristics, you can achieve near-optimal performance in a fraction of the time required by an exhaustive search.

Figure 1 shows the geomean performance score for a set of GEMMs found in Llama 3 405B training workloads on an NVIDIA H100 SXM GPU. The score is the speedup relative to the best-performing kernel found by an exhaustive search (the “baseline” at 1.0), plotted against the total build and profiling time.

As the data shows, an exhaustive search takes over 700 minutes to find the optimal kernel. In contrast, using nvMatmulHeuristics to select just 16 candidate kernels achieves 96% of the peak performance in approximately 150 minutes.

By increasing the number of candidates, users can approach the performance of the exhaustive search baseline while still realizing a massive reduction in build plus tuning time. This makes it practical to find high-performing kernels in environments like PyTorch, where JIT compilation times are critical.

Scatter plot showing the build time and achieved performance of nvMatmulHeuristics compared with an exhaustive baseline. — *Figure 1. nvMatmulHeuristics performance on Llama 3 405B on NVIDIA H100 SXM*

Figure 2 shows similar results on a DeepSeek-R1 671B training workload on an NVIDIA B200 GPU. Performance reaches 99% of exhaustive search with a more than 5x speedup in build and auto-tuning time. Because only a few kernels need to be built, it’s possible to get the “JIT” benefit of building more behavior into the kernel statically, whereas pre-compiled sets of kernels often move more behavior to runtime to reduce the number of compiled kernels.

In this case, the baseline uses dynamic cluster sizes, as is typical in NVIDIA Blackwell precompiled kernels from CUTLASS, whereas the kernels suggested by nvMatmulHeuristics were built with the known static cluster sizes now available at compile time. For these GEMMs, this results in achieved performance of 104% of the baseline.

Scatter plot showing build and achieved geomean speedup performance for nvMatmulHeuristics against a CUTLASS exhaustive baseline search. The experiment is conducted on a B200 with a fixed clock at 1530 MHz. The problem set used represents GEMMs from DeepSeek-R1 671B. — *Figure 2. nvMatmulHeuristics performance on DeepSeek-R1 675B on NVIDIA B200*

Get started with improving GEMM auto-tuning workflows

These results show how a good heuristic can enable you to achieve peak performance for your workloads without the cost of manual or exhaustive tuning. nvMatmulHeuristics can push the envelope for state-of-the-art performance and productivity across DL frameworks, compilers, and kernel libraries.

Download nvMatmulHeuristics to get started. Ask questions and join the conversation in the NVIDIA Developer Forum or email Math-Libs-Feedback@nvidia.com.

Acknowledgments

We would like to express our gratitude to all of the CUTLASS OSS contributors. Without their foundational contributions, CUTLASS 4 would not have been possible.