Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2

Selecting the best possible General Matrix Multiplication (GEMM) kernel for a specific problem and hardware is a significant challenge. The performance of a GEMM kernel is determined by an array of compile-time and runtime meta-parameters: CTA, warp and instruction level tile sizes, kernel schedules, rasterization strategies, cluster dimensions, split-k factors, and so on. The traditional … Continue reading Improving GEMM Kernel Auto-Tuning Efficiency on NVIDIA GPUs with Heuristics and CUTLASS 4.2