Agentic AI / Generative AI

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

Mixture of experts icons for attention kernels.

Feb 12, 2025

By Terry Chen, Bing Xu and Kirthi Devleker

Discuss (2)

AI-Generated Summary

Dislike

NVIDIA engineers used the DeepSeek-R1 model with additional computing power during inference to automatically generate optimized GPU attention kernels without explicit programming.
The workflow involved a closed-loop approach with a verifier and the DeepSeek-R1 model, which improved the code generation process by guiding it differently each time.
The results produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford's KernelBench benchmark.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long-thinking, this technique improves model performance by allocating additional computational resources during inference to evaluate multiple possible outcomes and then selecting the best one, neural network. This enables AI to strategize and systematically solve complex problems in a similar fashion to how humans dissect complex problems and solve them individually to arrive at a final solution.

In this post, we talk about an experiment done by NVIDIA engineers who used one of the newest open-source models, the DeepSeek-R1 model, together with additional computing power during inference to solve a complex problem. The experiment was to automatically generate GPU attention kernels that were numerically correct and optimized for different flavors of attention without any explicit programming.

The results turned out to be better than the optimized kernels developed by skilled engineers in some cases.

The need for optimized attention kernels and associated challenges

Attention is a key concept that revolutionized the development of the large language model (LLM). It’s a powerful mechanism that enables AI models to focus selectively on the most relevant parts of input when performing tasks. By focusing on important information, the attention operation helps the models make better predictions and find hidden patterns in the data.

The computational complexity of the attention operation grows quadratically in relation to the input sequence length. This motivates the need for developing an optimized lower-level implementation (that is, a GPU kernel) to prevent runtime errors arising from simple implementations (for example, out-of-memory errors) and for computational efficiency purposes.

There are multiple variants of attention (causal, relative positional embeddings, alibi, and so on) and often engineers must use a combination of these variants for a given task. ‌

Multi-modal models (for example, vision transformers) introduce an additional layer of challenges as they require specialized attention mechanisms (Spatial Neighborhood Attention) for maintaining spatio-temporal information often encountered in computer vision, video generation models, and so on.

Creating an optimized GPU kernel for attention takes a lot of skill and time, even for experienced software engineers. ‌

Recent LLMs like DeepSeek-R1 have shown a lot of promise in code generation tasks, but they still face challenges creating optimized code on the first try. This makes it necessary to use other strategies at inference time to generate optimized code.

The following prompt is sample user input for a relative positional embeddings attention kernel.

Please write a GPU attention kernel to support relative position encodings. Implement the relative positional encoding on the fly within the kernel. The complete code should be returned, including the necessary modifications.

Use the following function to compute the relative positional encoding:

def relative_positional(score, b, h, q_idx, kv_idx):

    return score + (q_idx - kv_idx)

When implementing the kernel, keep in mind that a constant scaling factor 1.44269504 should be applied to the relative positional encoding due to qk_scale = sm_scale * 1.44269504. The PyTorch reference does not need to scale the relative positional encoding, but in the GPU kernel, use:

qk = qk * qk_scale + rel_pos * 1.44269504

Please provide the complete updated kernel code that incorporates these changes, ensuring that the relative positional encoding is applied efficiently within the kernel operations.

LLMs can occasionally produce hallucinated code or mix syntax from different languages or frameworks, causing immediate code errors or inefficiencies. Computing the optimal GPU thread mapping is also non-trivial and a challenging task, often requiring iterative refinement to achieve a correct and efficient kernel.

Inference-time scaling for generating optimized GPU Kernels

To get the best results with optimized attention kernels, NVIDIA engineers created a new workflow that includes a special verifier along with the DeepSeek-R1 model during inference in a closed-loop fashion for a predetermined duration.

The workflow is first initialized by a manual prompt and the DeepSeek-R1 model generates the GPU code (that is, the kernel) in the first pass. The verifier runs on an NVIDIA H100 GPU. It analyzes the generated kernel and creates new prompts that are provided as ‌input to the DeepSeek-R1 model.

This closed-loop approach makes the code generation process better by guiding it in a different way each time. The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel.

This workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford’s KernelBench benchmark. ‌

The Level-1 solving rate in KernelBench refers to the numerical correct metric used to evaluate the ability of LLMs to generate efficient GPU kernels for specific computational tasks. This test is part of a series of challenges to test the latest LLMs’ abilities in GPU programming.

Figure 4 shows how the inference-time budget affects the agent’s solving rate. Allocating more than 10 minutes per problem in the Level-1 category enables the workflow to produce numerical correct code for most of the 100 problems.

Optimized GPU kernels on DeepSeek-R1

These results show how you can use the latest DeepSeek-R1 model to give better GPU kernels by using more computing power during inference time. This is still a new research area with early results on a promising approach that automatically generates effective attention kernels.

While we are off to a good start, more work is needed to generate better results consistently for a wider variety of problems. We’re excited about the recent developments in DeepSeek-R1 and its potential.

For more information or to get started, see the DeepSeek-R1 NIM microservice, now available on build.nvidia.com.

Discuss (2)

About the Authors

About Terry Chen
Terry Chen is a principal engineer at NVIDIA. Prior to NVIDIA, he was VP of engineering at HippoML. As a co-author of AITemplate, he contributed to GPU optimization frameworks. His expertise encompasses large language models, AI agents, GPU inference optimization, and multi-modal AI applications.

View all posts by Terry Chen

About Bing Xu
Bing Xu is a distinguished engineer at NVIDIA. Previously, he was co-founder and CEO of HippoML. His AI research has been cited more than 100K times. He was creator or co-creator of deep learning systems such as CXXNet, MXNet, and AITemplate. He is interested in building new LLM-centric software systems.

View all posts by Bing Xu

About Kirthi Devleker
Kirthi K. Devleker is a technology marketing leader at NVIDIA, where he drives the launch and positioning of transformative AI platforms and the GPU architectures that power them. He played a pivotal role in bringing NVIDIA’s groundbreaking Grace Blackwell architecture to market, including the Grace Blackwell and Grace Blackwell Ultra platforms—redefining performance, scalability, and efficiency for generative AI at global scale. Kirthi specializes in crafting compelling messages around NVIDIA’s datacenter GPU technologies, highlighting their performance advantages and ROI for enterprise AI adoption. Previously, at MathWorks, he led the global Medical Devices business unit and spearheaded strategic product management initiatives that guided the Signal Processing group’s roadmap towards AI. His leadership accelerated machine learning integration across medical devices, aerospace and defense and automotive sectors. As a recognized industry voice, Kirthi has delivered keynotes and technical talks at international conferences on AI-driven engineering and simulation. He holds a Master of Science in Electrical Engineering from San Jose State University, with a specialization in signal and image processing.

View all posts by Kirthi Devleker