Benchmarking LLMs on AI-Generated CUDA Code with ComputeEval 2025.2

Over 100 new CUDA challenges added to test LLMs on modern CUDA features

Can AI coding assistants write efficient CUDA code? To help measure and improve their capabilities, we created ComputeEval, a robust, open source benchmark for evaluating AI models and agents on CUDA programming tasks.

A few months ago, we announced the first release of ComputeEval and today, we’re introducing its first major expansion by adding more than 100 new CUDA challenges.

With this release, the dataset has grown to a total of 232 of CUDA and CUDA Compute Core Libraries (CCCL) problems. We deliberately raised the bar by adding more difficult challenges that require LLMs to use modern CUDA features, such as Tensor Cores, advanced shared memory patterns, and warp-level primitives. The new problems test the ability to correctly orchestrate features like CUDA Graphs, Streams, and Events. All within the context of real-world applications like dynamic simulations.

LLM performance on CUDA programming

Our team evaluated several leading LLMs on ComputeEval to establish baseline performance metrics and understand the current state of AI-assisted CUDA programming (Table 1).

Model	ComputeEval 2025.2 232 new problems pass@1	ComputeEval 2025.1 128 problems pass@1
GPT-5 (medium)	0.5819	0.61
Claude Sonnet 4.0	0.5517	0.64
gpt-oss-20B (high)	0.5474	N/A
gpt-oss-120b (high)	0.5302	N/A
Claude Opus 4.0	0.5216	N/A
DeepSeek-R1	0.4397	0.55
gpt-oss-120b (medium)	0.4224	N/A
gpt-oss-20b (medium)	0.4224	N/A
gpt-oss-120b (low)	0.4052	N/A
DeepSeek-V3.1	0.3750	0.44
Llama 4 Maverick 17B 128E	0.3448	0.47
Llama 3.1 405B	0.3405	0.4
gpt-oss-20B (low)	0.3319	0.41

Table 1. Pass@1 accuracy of state-of-the-art LLMs on ComputeEval 2025.1 and 2025.2. The latest version introduces 232 new CUDA programming challenges, providing a tougher benchmark for AI-assisted coding.

We observed that scores for all models declined with the move to ComputeEval 2025.2. This doesn’t indicate that the models are becoming less capable—rather, it reflects that the benchmark itself has become more challenging. With each release, we’re raising the bar for AI, pushing it to demonstrate a deeper understanding of the nuances of accelerated computing.

What’s next and how to get involved

We’ll continue expanding both the dataset and the capabilities of the evaluation framework. Work is already underway to extend ComputeEval’s coverage to additional CUDA-X libraries, including cuBLAS, CUTLASS, cuDNN, RAPIDS, and more. We invite the broader HPC and AI communities to contribute and collaborate. Explore the code on GitHub and access the dataset on Hugging Face.