Generative AI

Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling

Mixture of experts icons for attention kernels.

As AI models extend their capabilities to solve more sophisticated challenges, a new scaling law known as test-time scaling or inference-time scaling is emerging. Also known as AI reasoning or long-thinking, this technique improves model performance by allocating additional computational resources during inference to evaluate multiple possible outcomes and then selecting the best one, neural network. This enables AI to strategize and systematically solve complex problems in a similar fashion to how humans dissect complex problems and solve them individually to arrive at a final solution.

In this post, we talk about an experiment done by NVIDIA engineers who used one of the newest open-source models, the DeepSeek-R1 model, together with additional computing power during inference to solve a complex problem. The experiment was to automatically generate GPU attention kernels that were numerically correct and optimized for different flavors of attention without any explicit programming. 

The results turned out to be better than the optimized kernels developed by skilled engineers in some cases. 

The need for optimized attention kernels and associated challenges

Attention is a key concept that revolutionized the development of the large language model (LLM). It’s a powerful mechanism that enables AI models to focus selectively on the most relevant parts of input when performing tasks. By focusing on important information, the attention operation helps the models make better predictions and find hidden patterns in the data. 

The computational complexity of the attention operation grows quadratically in relation to the input sequence length. This motivates the need for developing an optimized lower-level implementation (that is, a GPU kernel) to prevent runtime errors arising from simple implementations (for example, out-of-memory errors) and for computational efficiency purposes. 

There are multiple variants of attention (causal, relative positional embeddings, alibi, and so on) and often engineers must use a combination of these variants for a given task. ‌

Multi-modal models (for example, vision transformers) introduce an additional layer of challenges as they require specialized attention mechanisms (Spatial Neighborhood Attention) for maintaining spatio-temporal information often encountered in computer vision, video generation models, and so on. 

Four images of people watching ski jumps have different dilation values, from 1-4. The dilation expands the bounding box from just one skier (dilation=1) to almost the entire image (dilation=4).
Figure 1. Neighborhood attention on 2D inputs

Creating an optimized GPU kernel for attention takes a lot of skill and time, even for experienced software engineers. ‌

Recent LLMs like DeepSeek-R1 have shown a lot of promise in code generation tasks, but they still face challenges creating optimized code on the first try. This makes it necessary to use other strategies at inference time to generate optimized code. 

The following prompt is sample user input for a relative positional embeddings attention kernel.

Please write a GPU attention kernel to support relative position encodings. Implement the relative positional encoding on the fly within the kernel. The complete code should be returned, including the necessary modifications.

Use the following function to compute the relative positional encoding:

def relative_positional(score, b, h, q_idx, kv_idx):

    return score + (q_idx - kv_idx)

When implementing the kernel, keep in mind that a constant scaling factor 1.44269504 should be applied to the relative positional encoding due to qk_scale = sm_scale * 1.44269504. The PyTorch reference does not need to scale the relative positional encoding, but in the GPU kernel, use:

qk = qk * qk_scale + rel_pos * 1.44269504

Please provide the complete updated kernel code that incorporates these changes, ensuring that the relative positional encoding is applied efficiently within the kernel operations.

LLMs can occasionally produce hallucinated code or mix syntax from different languages or frameworks, causing immediate code errors or inefficiencies. Computing the optimal GPU thread mapping is also non-trivial and a challenging task, often requiring iterative refinement to achieve a correct and efficient kernel.

Inference-time scaling for generating optimized GPU Kernels

To get the best results with optimized attention kernels, NVIDIA engineers created a new workflow that includes a special verifier along with the DeepSeek-R1 model during inference in a closed-loop fashion for a predetermined duration. 

A flow chart shows the initial prompt generating an attention kernel. DeepSeek-R1 creates GPU code, which is verified. If the criteria are not met, Hopper GPUs refine and format the prompt. The end result is GPU-optimized kernels.
Figure 2. Inference-time scaling with DeepSeek-R1 on the NVIDIA Hopper platform

The workflow is first initialized by a manual prompt and the DeepSeek-R1 model generates the GPU code (that is, the kernel) in the first pass. The verifier runs on an NVIDIA H100 GPU. It analyzes the generated kernel and creates new prompts that are provided as ‌input to the DeepSeek-R1 model. 

This closed-loop approach makes the code generation process better by guiding it in a different way each time. The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel. 

A bar chart showing averaged attention kernel speedup on Hopper GPU, compares the speedup of different attention kernel types between two approaches: 'PyTorch API (Flex Attention)' in orange and 'NVIDIA Workflow with DeepSeek-R1' in green. The PyTorch API maintains a baseline of 1x for all kernels, while the NVIDIA Workflow with DeepSeek-R1 achieves speedups of 1.1x for Causal Mask and Document Mask, 1.5x for Relative Positional, 1.6x for Alibi Bias and Full Mask, and 2.1x for Softcap.
Figure 3. Performance of automatically generated optimized attention kernels with flex attention

This workflow produced numerically correct kernels for 100% of Level-1 problems and 96% of Level-2 problems, as tested by Stanford’s KernelBench benchmark. ‌

The Level-1 solving rate in KernelBench refers to the numerical correct metric used to evaluate the ability of LLMs to generate efficient GPU kernels for specific computational tasks. This test is part of a series of challenges to test the latest LLMs’ abilities in GPU programming.

Figure 4 shows how the inference-time budget affects the agent’s solving rate. Allocating more than 10 minutes per problem in the Level-1 category enables the workflow to produce numerical correct code for most of the 100 problems.

A line chart shows the number of numerically correct kernels generated over time. The line approaches 95% at ~10 minutes and 100% at 20 minutes.
Figure 4. Inference-time scaling results in optimized GPU kernels

Optimized GPU kernels on DeepSeek-R1

These results show how you can use the latest DeepSeek-R1 model to give better GPU kernels by using more computing power during inference time. This is still a new research area with early results on a promising approach that automatically generates effective attention kernels. 

While we are off to a good start, more work is needed to generate better results consistently for a wider variety of problems. We’re excited about the recent developments in DeepSeek-R1 and its potential. 

For more information or to get started, see the DeepSeek-R1 NIM microservice, now available on build.nvidia.com.

Discuss (2)

Tags