AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how inference systems perform under these conditions. Artificial Analysis AgentPerf (AA-AgentPerf) offers the industry’s first multi-vendor open benchmarks profiling trajectories that are representative of real-world AI agent coding tasks.
This post explains how AA-AgentPerf sets a new standard for measuring agentic workload performance, and how NVIDIA extreme co-design helps deliver up to 20x better agentic coding performance than previous generations.
What is AA-AgentPerf?
AA-AgentPerf is a hardware benchmark created by Artificial Analysis that measures the number of concurrent AI agents an inference system can support while meeting predefined, model-specific performance service level objective (SLO) tiers. An SLO is defined as a specific threshold of output token speed and time-to-first-token (TTFT). The benchmark results are normalized per accelerator and per megawatt to enable comparison across hardware configurations.

Measuring representative agentic coding performance
Agentic workloads are unique because LLM-driven decisions often produce non-deterministic sequences of requests and tool calls. The most difficult part of measuring agent performance is to accurately capture this non-determinism in a representative agent trajectory—the complete sequence of actions, decisions, and observations made by an agent as it traverses through a task from beginning to end (Figure 2).

AA-AgentPerf captures this by measuring GPU performance across prerecorded agentic coding trajectories with interleaved reasoning and tool use, while simulating interturn latency with a representative baseline for CPU tool-call performance. These trajectories are built around solving issues in public code repositories across several use-cases,12+ programming languages, and response from frontier models. In addition to rigorous definition of the trajectories, the Artificial Analysis team also:
- Leveraged representative cached, input, and output sequence lengths for requests, ranging from 5K to 131K with a mean of approximately 27K.
- Mapped tool calls to representative CPU-side tasks in agentic coding workflows and simulated tool calls across a distribution with a one-second median delay time. The same CPU tool-call baseline was then applied across all systems tested.
- Keeps the test-set private to prevent benchmark-targeted optimization.
AA-AgentPerf testing and measurement methodology
The AA-AgentPerf harness measures the number of concurrent agents an inference system can support while meeting SLO requirements (Figure 3). At launch, this benchmark focuses on testing DeepSeek-V4-Pro across multiple SLO tiers derived from Artificial Analysis serverless API benchmarking data. This ensures that the benchmarks reflect quality-of-service levels observed in production providers today.

During a benchmarking run, AA-AgentPerf sends GPUs thousands of concurrent requests drawn from its prerecorded agent trajectory dataset. To ensure independent results for each run, dynamic prefixes are added at the start of every trajectory phase. Strict SLO thresholds are enforced throughout the trajectory, and the highest concurrency level that satisfies those requirements is recorded as the official benchmark result for a given SLO (Figure 3). This process is then repeated across multiple SLO tiers to capture different user experience targets (Table 1).
| Model | SLO tier | P25 output speed (tokens/second) | P95 TTFT (seconds) |
| DeepSeek-V4-Pro | SLO #1 | 30 | 10 |
| SLO #2 | 100 | 5 | |
| SLO #3 | 300 | 3 |
How to interpret AA-AgentPerf results
The core AA-AgentPerf metric is runtime power per megawatt—a practical normalization for representing data center scale performance. Table 2 outlines how to leverage the reported performance to estimate how many agentic sessions could be supported for a given power budget.
| Benchmark | Value of metric | NVIDIA GB300 NVL72 | NVIDIA H200 |
| Concurrent agents per MW | Energy efficiency: How many active agents a system can support for a given power budget | 61.4K | 2.6K |
| Concurrent agents per GPU | Hardware efficiency: How much serving capacity is achieved per GPU | 57.5 | 1.4 |
On launch day, NVIDIA GB300 NVL72 delivers up to 20x more concurrent agents per megawatt than the previous generation, NVIDIA H200 (Figure 4).

This performance highlights how GB300 NVL72 is able to deliver across large-scale agentic coding workloads, from routing long-lived sessions efficiently to keeping mixture of experts (MoEs) and GPUs fully utilized across many concurrent agent sessions..
- SGLang, TensorRT LLM, or vLLM: Agent runtimes apply optimizations such as WideEP and DeepEP to spread MoE expert execution across the full NVL72 domain, maximizing effective batch sizes and scaling effectively to thousands of agents.
- DeepGEMM and Mega MoE optimizations: MXFP4/MXFP8 kernels and fused MoE overlap NVLink communication with tensor core compute to boost token throughput for reasoning and code generation.
- NVIDIA NVLink scale-up domain: GB300 NVL72 links 72 GPUs into a single high-bandwidth NVLink fabric, so every GPU can rapidly share parameters, KV cache, and intermediate results—critical for fast, coordinated execution of agentic coding systems.
Looking forward: NVIDIA Vera Rubin platform
AA-AgentPerf establishes the standard for evaluating agentic inference, and the results highlight how tightly integrated hardware and software can unlock step-function gains in concurrency and efficiency. NVIDIA GB300 NVL72 demonstrates up to 20x higher agentic coding performance.
The NVIDIA Vera Rubin platform is expected to extend these gains by leveraging 50 PFLOPs of NVFP4 compute and leveraging the Vera CPU to accelerate LLM tool calls and improve end-to-end performance, economics, and efficiency for agentic workflows.
To learn more about why agentic workloads place unique demands on inference infrastructure and how the NVIDIA Vera Rubin platform optimizes performance, see Building for the Rising Complexity of Agentic Systems with Extreme Co-Design.
Acknowledgments
This work was made possible through the expertise and engineering contributions of Jatin Gangani, Iman Tabrizian, Xiaoming Chen, Peiheng Hu, Taizhong Wu, Shichen Li, Manu Maheswari, and many other talented NVIDIA engineers.