Top Stories

NVIDIA Achieves Leading Agentic Coding Performance on First Agentic AI Benchmark

AI agents have fundamentally changed the complexity of inference workloads. Until now, the industry has struggled to define a standard for measuring how inference systems perform under these conditions. Artificial Analysis AgentPerf (AA-AgentPerf) offers the industry’s first multi-vendor open benchmarks profiling trajectories that are representative of real-world AI agent coding tasks. 

This post explains how AA-AgentPerf sets a new standard for measuring agentic workload performance, and how NVIDIA extreme co-design helps deliver up to 20x better agentic coding performance than previous generations.

What is AA-AgentPerf?

AA-AgentPerf is a hardware benchmark created by Artificial Analysis that measures the number of concurrent AI agents an inference system can support while meeting predefined, model-specific performance service level objective (SLO) tiers. An SLO is defined as a specific threshold of output token speed and time-to-first-token (TTFT). The benchmark results are normalized per accelerator and per megawatt to enable comparison across hardware configurations.

Measuring representative agentic coding performance

Agentic workloads are unique because LLM-driven decisions often produce non-deterministic sequences of requests and tool calls. The most difficult part of measuring agent performance is to accurately capture this non-determinism in a representative agent trajectory—the complete sequence of actions, decisions, and observations made by an agent as it traverses through a task from beginning to end (Figure 2). 

AA-AgentPerf captures this by measuring GPU performance across prerecorded agentic coding trajectories with interleaved reasoning and tool use, while simulating interturn latency with a representative baseline for CPU tool-call performance. These trajectories are built around solving issues in public code repositories across several use-cases,12+ programming languages, and response from frontier models. In addition to rigorous definition of the trajectories, the Artificial Analysis team also:

  • Leveraged representative cached, input, and output sequence lengths for requests, ranging from 5K to 131K with a mean of approximately 27K.
  • Mapped tool calls to representative CPU-side tasks in agentic coding workflows and simulated tool calls across a distribution with a one-second median delay time. The same CPU tool-call baseline was then applied across all systems tested.
  • Keeps the test-set private to prevent benchmark-targeted optimization.

AA-AgentPerf testing and measurement methodology

The AA-AgentPerf harness measures the number of concurrent agents an inference system can support while meeting SLO requirements (Figure 3). At launch, this benchmark focuses on testing DeepSeek-V4-Pro across multiple SLO tiers derived from Artificial Analysis serverless API benchmarking data. This ensures that the benchmarks reflect quality-of-service levels observed in production providers today. 

During a benchmarking run, AA-AgentPerf sends GPUs thousands of concurrent requests drawn from its prerecorded agent trajectory dataset. To ensure independent results for each run, dynamic prefixes are added at the start of every trajectory phase. Strict SLO thresholds are enforced throughout the trajectory, and the highest concurrency level that satisfies those requirements is recorded as the official benchmark result for a given SLO (Figure 3). This process is then repeated across multiple SLO tiers to capture different user experience targets (Table 1).

ModelSLO tierP25 output speed (tokens/second)P95 TTFT (seconds)
DeepSeek-V4-ProSLO #13010
SLO #21005
SLO #33003
Table 1. SLO tiers and TTFT requirements for AA-AgentPerf DeepSeek-V4-PRO tests

How to interpret AA-AgentPerf results

The core AA-AgentPerf metric is runtime power per megawatt—a practical normalization for representing data center scale performance. Table 2 outlines how to leverage the reported performance to estimate how many agentic sessions could be supported for a given power budget. 

BenchmarkValue of metricNVIDIA GB300 NVL72NVIDIA H200
Concurrent agents per MWEnergy efficiency: How many active agents a system can support for a given power budget61.4K2.6K
Concurrent agents per GPUHardware efficiency: How much serving capacity is achieved per GPU57.51.4
Table 2. How to leverage the metrics reported by AgentPerf to aid in capacity planning for data centers aiming to support agentic applications at scale. Numbers reflect AA-AgentPerf results for SLO=30 configurations

On launch day, NVIDIA GB300 NVL72 delivers up to 20x more concurrent agents per megawatt than the previous generation, NVIDIA H200 (Figure 4).

This performance highlights how GB300 NVL72 is able to deliver across large-scale agentic coding workloads, from routing long-lived sessions efficiently to keeping mixture of experts (MoEs) and GPUs fully utilized across many concurrent agent sessions..

  • SGLang, TensorRT LLM, or vLLM: Agent runtimes apply optimizations such as WideEP and DeepEP to spread MoE expert execution across the full NVL72 domain, maximizing effective batch sizes and scaling effectively to thousands of agents.
  • DeepGEMM and Mega MoE optimizations: MXFP4/MXFP8 kernels and fused MoE overlap NVLink communication with tensor core compute to boost token throughput for reasoning and code generation.
  • NVIDIA NVLink scale-up domain: GB300 NVL72 links 72 GPUs into a single high-bandwidth NVLink fabric, so every GPU can rapidly share parameters, KV cache, and intermediate results—critical for fast, coordinated execution of agentic coding systems.

Looking forward: NVIDIA Vera Rubin platform

AA-AgentPerf establishes the standard for evaluating agentic inference, and the results highlight how tightly integrated hardware and software can unlock step-function gains in concurrency and efficiency. NVIDIA GB300 NVL72 demonstrates up to 20x higher agentic coding performance. 

The NVIDIA Vera Rubin platform is expected to extend these gains by leveraging 50 PFLOPs of NVFP4 compute and leveraging the Vera CPU to accelerate LLM tool calls and improve end-to-end performance, economics, and efficiency for agentic workflows. 

To learn more about why agentic workloads place unique demands on inference infrastructure and how the NVIDIA Vera Rubin platform optimizes performance, see Building for the Rising Complexity of Agentic Systems with Extreme Co-Design.

Acknowledgments

This work was made possible through the expertise and engineering contributions of Jatin Gangani, Iman Tabrizian, Xiaoming Chen, Peiheng Hu, Taizhong Wu, Shichen Li, Manu Maheswari, and many other talented NVIDIA engineers.

Discuss (0)

Tags