NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

Jun 12, 2026

By Eduardo Alvarez, Shobhit Verma and Amit Kushwaha

AI-Generated Summary

Dislike

Artificial Analysis AA-AgentPerf provides the first open, multi-vendor benchmark for measuring concurrent AI agent support under real-world coding trajectories, with hardware results normalized per accelerator and per megawatt.
The benchmark uniquely captures agentic workload complexity, including non-deterministic sequences, tool call latencies, and variable sequence lengths, using private, representative test sets to avoid benchmark-specific optimization.
At launch, NVIDIA GB300 NVL72 demonstrated up to 20x higher concurrent agent throughput per megawatt than NVIDIA H200, leveraging optimizations such as WideEP/DeepEP, DeepGEMM, fused MoE, and NVLink scale-up, while the upcoming NVIDIA Vera Rubin platform is projected to further increase performance with 50 PFLOPs of NVFP4 compute and enhanced LLM tool call acceleration.

AI-generated content may summarize information incompletely. Verify important information. Learn more

AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how inference systems perform under these conditions. Artificial Analysis AgentPerf (AA-AgentPerf) offers the industry’s first multi-vendor open benchmarks profiling trajectories that are representative of real-world AI agent coding tasks.

This post explains how AA-AgentPerf sets a new standard for measuring agentic workload performance, and how NVIDIA extreme co-design helps deliver up to 20x better agentic coding performance than previous generations.

What is AA-AgentPerf?

AA-AgentPerf is a hardware benchmark created by Artificial Analysis that measures the number of concurrent AI agents an inference system can support while meeting predefined, model-specific performance service level objective (SLO) tiers. An SLO is defined as a specific threshold of output token speed and time-to-first-token (TTFT). The benchmark results are normalized per accelerator and per megawatt to enable comparison across hardware configurations.

Measuring representative agentic coding performance

Agentic workloads are unique because LLM-driven decisions often produce non-deterministic sequences of requests and tool calls. The most difficult part of measuring agent performance is to accurately capture this non-determinism in a representative agent trajectory—the complete sequence of actions, decisions, and observations made by an agent as it traverses through a task from beginning to end (Figure 2).

AA-AgentPerf captures this by measuring GPU performance across prerecorded agentic coding trajectories with interleaved reasoning and tool use, while simulating interturn latency with a representative baseline for CPU tool-call performance. These trajectories are built around solving issues in public code repositories across several use-cases,12+ programming languages, and response from frontier models. In addition to rigorous definition of the trajectories, the Artificial Analysis team also:

Leveraged representative cached, input, and output sequence lengths for requests, ranging from 5K to 131K with a mean of approximately 27K.
Mapped tool calls to representative CPU-side tasks in agentic coding workflows and simulated tool calls across a distribution with a one-second median delay time. The same CPU tool-call baseline was then applied across all systems tested.
Keeps the test-set private to prevent benchmark-targeted optimization.

AA-AgentPerf testing and measurement methodology

The AA-AgentPerf harness measures the number of concurrent agents an inference system can support while meeting SLO requirements (Figure 3). At launch, this benchmark focuses on testing DeepSeek-V4-Pro across multiple SLO tiers derived from Artificial Analysis serverless API benchmarking data. This ensures that the benchmarks reflect quality-of-service levels observed in production providers today.

During a benchmarking run, AA-AgentPerf sends GPUs thousands of concurrent requests drawn from its prerecorded agent trajectory dataset. To ensure independent results for each run, dynamic prefixes are added at the start of every trajectory phase. Strict SLO thresholds are enforced throughout the trajectory, and the highest concurrency level that satisfies those requirements is recorded as the official benchmark result for a given SLO (Figure 3). This process is then repeated across multiple SLO tiers to capture different user experience targets (Table 1).

Model	SLO tier	P25 output speed (tokens/second)	P95 TTFT (seconds)
DeepSeek-V4-Pro	SLO #1	30	10
	SLO #2	100	5
	SLO #3	300	3

Table 1. SLO tiers and TTFT requirements for AA-AgentPerf DeepSeek-V4-PRO tests

How to interpret AA-AgentPerf results

The core AA-AgentPerf metric is runtime power per megawatt—a practical normalization for representing data center scale performance. Table 2 outlines how to leverage the reported performance to estimate how many agentic sessions could be supported for a given power budget.

Benchmark	Value of metric	NVIDIA GB300 NVL72	NVIDIA H200
Concurrent agents per MW	Energy efficiency: How many active agents a system can support for a given power budget	61.4K	2.6K
Concurrent agents per GPU	Hardware efficiency: How much serving capacity is achieved per GPU	57.5	1.4

Table 2. How to leverage the metrics reported by AgentPerf to aid in capacity planning for data centers aiming to support agentic applications at scale. Numbers reflect AA-AgentPerf results for SLO=30 configurations

On launch day, NVIDIA GB300 NVL72 delivers up to 20x more concurrent agents per megawatt than the previous generation, NVIDIA H200 (Figure 4).

This performance highlights how GB300 NVL72 is able to deliver across large-scale agentic coding workloads, from routing long-lived sessions efficiently to keeping mixture of experts (MoEs) and GPUs fully utilized across many concurrent agent sessions..

SGLang, TensorRT LLM, or vLLM: Agent runtimes apply optimizations such as WideEP and DeepEP to spread MoE expert execution across the full NVL72 domain, maximizing effective batch sizes and scaling effectively to thousands of agents.
DeepGEMM and Mega MoE optimizations: MXFP4/MXFP8 kernels and fused MoE overlap NVLink communication with tensor core compute to boost token throughput for reasoning and code generation.
NVIDIA NVLink scale-up domain: GB300 NVL72 links 72 GPUs into a single high-bandwidth NVLink fabric, so every GPU can rapidly share parameters, KV cache, and intermediate results—critical for fast, coordinated execution of agentic coding systems.

Looking forward: NVIDIA Vera Rubin platform

AA-AgentPerf establishes the standard for evaluating agentic inference, and the results highlight how tightly integrated hardware and software can unlock step-function gains in concurrency and efficiency. NVIDIA GB300 NVL72 demonstrates up to 20x higher agentic coding performance.

The NVIDIA Vera Rubin platform is expected to extend these gains by leveraging 50 PFLOPs of NVFP4 compute and leveraging the Vera CPU to accelerate LLM tool calls and improve end-to-end performance, economics, and efficiency for agentic workflows.

To learn more about why agentic workloads place unique demands on inference infrastructure and how the NVIDIA Vera Rubin platform optimizes performance, see Building for the Rising Complexity of Agentic Systems with Extreme Co-Design.

Acknowledgments

This work was made possible through the expertise and engineering contributions of Jatin Gangani, Iman Tabrizian, Xiaoming Chen, Peiheng Hu, Taizhong Wu, Shichen Li, Manu Maheswari, and many other talented NVIDIA engineers.

Discuss (0)

About the Authors

About Eduardo Alvarez
Eduardo Alvarez is a senior technical lead at NVIDIA, where he focuses on AI inference at scale, performance optimization, workload economic analysis, and application enablement. He has a deep background in AI systems engineering, workload optimization, and accelerated computing—focused on translating innovations into real-world applications. Before NVIDIA, Eduardo held engineering roles at various semiconductor and energy tech companies.

View all posts by Eduardo Alvarez

About Shobhit Verma
Shobhit Verma is a software engineer on the TensorRT team at NVIDIA, where he focuses on MLPerf Inference. He has experience in the design and verification of ML accelerators, developing high performance computing applications and distributed systems. Shobhit holds an M.Sc. in computer science from the University of Chicago and a B.Sc. in computer engineering from Delhi Technological University

View all posts by Shobhit Verma

About Amit Kushwaha
Amit Kushwaha is a deep learning solutions architect at NVIDIA, specializing in inference optimization and production deployment of LLM and agentic AI systems. Previously, he served as Director of AI Engineering at SambaNova Systems, architecting Generative AI solutions on novel accelerator hardware, and as Principal Data Scientist at ExxonMobil, applying machine learning to complex physical and industrial systems. He holds a Ph.D. in Engineering from Stanford University, where his research focused on scientific computing and large-scale simulations.

View all posts by Amit Kushwaha