Can AI coding assistants write efficient CUDA code? To help measure and improve their capabilities, we created ComputeEval, a robust, open source benchmark for evaluating AI models and agents on CUDA programming tasks.
A few months ago, we announced the first release of ComputeEval and today, we’re introducing its first major expansion by adding more than 100 new CUDA challenges.
With this release, the dataset has grown to a total of 232 of CUDA and CUDA Compute Core Libraries (CCCL) problems. We deliberately raised the bar by adding more difficult challenges that require LLMs to use modern CUDA features, such as Tensor Cores, advanced shared memory patterns, and warp-level primitives. The new problems test the ability to correctly orchestrate features like CUDA Graphs, Streams, and Events. All within the context of real-world applications like dynamic simulations.
LLM performance on CUDA programming
Our team evaluated several leading LLMs on ComputeEval to establish baseline performance metrics and understand the current state of AI-assisted CUDA programming (Table 1).
| Model | ComputeEval 2025.2 232 new problems pass@1 | ComputeEval 2025.1 128 problems pass@1 |
| GPT-5 (medium) | 0.5819 | 0.61 |
| Claude Sonnet 4.0 | 0.5517 | 0.64 |
| gpt-oss-20B (high) | 0.5474 | N/A |
| gpt-oss-120b (high) | 0.5302 | N/A |
| Claude Opus 4.0 | 0.5216 | N/A |
| DeepSeek-R1 | 0.4397 | 0.55 |
| gpt-oss-120b (medium) | 0.4224 | N/A |
| gpt-oss-20b (medium) | 0.4224 | N/A |
| gpt-oss-120b (low) | 0.4052 | N/A |
| DeepSeek-V3.1 | 0.3750 | 0.44 |
| Llama 4 Maverick 17B 128E | 0.3448 | 0.47 |
| Llama 3.1 405B | 0.3405 | 0.4 |
| gpt-oss-20B (low) | 0.3319 | 0.41 |
We observed that scores for all models declined with the move to ComputeEval 2025.2. This doesn’t indicate that the models are becoming less capable—rather, it reflects that the benchmark itself has become more challenging. With each release, we’re raising the bar for AI, pushing it to demonstrate a deeper understanding of the nuances of accelerated computing.
What’s next and how to get involved
We’ll continue expanding both the dataset and the capabilities of the evaluation framework. Work is already underway to extend ComputeEval’s coverage to additional CUDA-X libraries, including cuBLAS, CUTLASS, cuDNN, RAPIDS, and more. We invite the broader HPC and AI communities to contribute and collaborate. Explore the code on GitHub and access the dataset on Hugging Face.