Agentic AI / Generative AI

How the NVIDIA Vera Rubin Platform is Solving Agentic AI’s Scale-Up Problem

Agentic inference has fundamentally changed the runtime dynamics of inference workloads by introducing non-deterministic trajectories—actions, observations, and decisions that an AI agent produces while working through a task. These trajectories compound end-to-end latency across hundreds of inference requests per session. 

NVIDIA Vera Rubin NVL72 handles the bulk of that inference load as the core compute engine of the NVIDIA Vera Rubin platform. The most demanding emerging multi-agent workloads require sustained low-latency and high-throughput generation on trillion-parameter MoE models with long-context windows. 

Until now, no platform has served this emerging workload economically. NVIDIA Groq 3 LPX, paired with Vera Rubin NVL72, is the first to deliver both high throughput and low latency at this point on the Pareto curve.

This post explores how the NVIDIA Vera Rubin Platform solves this challenge through extreme co-design, combining high-throughput compute with low-latency, deterministic execution across hundreds to thousands of chips.

Why agentic workloads require predictable scale-up networking

Conventional data center networking fabrics are optimized for large training jobs and volume inference workloads, where small amounts of network jitter average out inside large batches. Premium AI services, by contrast, demand higher model capability and highly responsive user-visible performance. At this tier, agentic decode brings a fundamentally different set of requirements, including:

  • Multi-turn model requests
  • Smaller batches
  • Extremely low latency

Long context and large MoE models (used in premium AI services) introduce additional networking challenges (Figure 1). Each agent in a multi-agent pipeline carries its own expanding KV cache, system prompt, tool definitions, and conversation history. That KV cache and any new tokens must be routed through trillion-parameter models and their associated experts across different accelerators. 

To pull this off, ‌network-level orchestration must ensure minimal variability in the hops between chips. This cross-chip exchange is unavoidable in any SRAM-based architecture that can’t hold the model on a single chip. The physical mechanism by which the exchange occurs becomes a key bottleneck in the serving system.

The industry has traditionally addressed this challenge by using:

  • Runtime-arbitrated networking fabrics where flow control is reactive, and timing is statistically bounded rather than guaranteed.
  • Large concentrations of on-die compute and memory that postpone the networking problem until model and context window sizes require them to scale up and out, resulting in deteriorating multi-chip performance.

Breaking the throughput-latency tradeoff at agentic scale requires networking fabric designed with the silicon, compiler, and serving stack. The LPU C2C achieves this with extreme co-design enabling multi-trillion parameter models at scale.

How NVIDIA Groq 3 LPX addresses scale-up challenges

The NVIDIA Groq 3 LPX LPU C2C is designed for solving scale-up problems directly. Rather than treating interconnects as a conventional network that must absorb contention and timing uncertainty at runtime, LPU C2C extends Groq’s deterministic execution model across many LPUs. It does this through three tightly connected technologies:

  • High-radix point-to-point links
  • LPU Compiler-scheduled data movement
  • Hardware-driven plesiosynchronous timing

Together, these technologies enable Groq 3 LPU accelerators the flexibility to scale to thousands of chips while preserving predictable communication, fixed latency, and low-jitter execution. The following sections examine each in turn.

Each LPU exposes 96 C2C links at 112 Gbps, delivering roughly 2.5 TB/s of scale-up bandwidth per LPU and 640 TB/s at the rack level. Built on the NVIDIA MGX rack-scale architecture, the design uses cableless trays and point-to-point, high-radix C2C topology to tightly couple compute and communication across trays and racks. 

Direct peer connections, dedicated paths, symmetric routes under load, and low hop counts enable highly efficient collective communication while the compiler plans every transfer statically rather than at runtime.

Compiler-scheduled data movement

LPU C2C scaling is software-scheduled. Communication between LPUs moves in 320-byte vectors, the same fixed-size unit used for compute, and is flow-controlled and scheduled at compile time as a first-class functional unit alongside the matrix, vector, and switch execution modules. The compiler plans every transfer in advance, including when each vector leaves its source LPU, which link it takes, and when it arrives, so load balancing, route selection, and synchronization are resolved statically rather than by hardware schedulers under contention. As a result, the compiler treats thousands of interconnected LPUs as a single scheduled execution surface, closer to wires between functional units on one die than to a network of independent chips.

Hardware-driven plesiosynchronous timing

Each LPU runs on its own clock, and because clocks naturally drift, LPU C2C scaling uses a plesiosynchronous or near-synchronous C2C protocol to cancel drift and align thousands of LPUs to act as a single core. With predictable data arrival and periodic software synchronization, the runtime avoids defensive buffering, making compile-time-known network latency possible at a scale most architectures can’t match. By eliminating unpredictable network hops, coordinating data movement, and fixing latency at compile time, these scale-up technologies enable Groq 3 LPX to operate hundreds or thousands of LPUs as one coherent, low-jitter system for agentic workloads that must coordinate tools, memory, and multi-step plans at speed.

How agentic workloads benefit from LPU C2C

The core payoff of LPU C2C is rack-scale determinism: 128 GB of unified on-chip SRAM with performance that stays predictable as you scale (Figure 3). This amount of SRAM in a tensor-parallel domain is the largest of any SRAM-based ASIC in production, and shows the LPU architecture’s superiority in scaling SRAM. 

The LPU compiler partitions trillion-parameter models across that pool using strategies like layer-wise partitioning, so the union of on-chip SRAM acts as a working memory far larger than any single chip can offer. For agentic workloads, this translates to frontier MoE models that run at low-latency without forcing tradeoffs in context window or accuracy. Tail latency stays bounded under the bursty fan-out patterns of multi-agent sessions, and per-token latency is predictable.

Low latency only goes so far on its own. AI factory deployments also need the compute capacity, throughput, and concurrent serving that come from a large GPU pool. That is where co-design with Vera Rubin NVL72 takes over. Vera Rubin NVL72 delivers up to 3,600 PFLOPS of NVFP4 compute, 20.7 TB of HBM4, and 1.6 PB/s of memory bandwidth per rack, handling prefill, long-context decode attention, and high-concurrency serving. When latency budgets tighten further, NVIDIA Dynamo (Figure 4) orchestrates a heterogeneous decode loop using Attention-FFN Disaggregation (AFD). This AFD loop is orchestrated in the following way: 

  • Rubin GPUs run decode attention over the accumulated KV cache
  • LPX accelerates FFN execution
  • Intermediate activations are exchanged each token through low-overhead, KV-aware transfers 

The division of labor works because the two engines target different timing regimes. Prefill and decode attention are throughput-dominated, with large batches, and KV-cache reads that amortize over many tokens, a profile well-matched to NVLink’s high-bandwidth scale-up interconnect. The FFN decode loop runs at small batch sizes with sequential token generation, where micro-jitter starts to dominate user-visible latency. Compile-time-scheduled C2C is purpose-built for that regime.

Together, Groq 3 LPX, Vera Rubin NVL72, and Dynamo form a platform that delivers deterministic low latency, frontier-model scale, long-context support, and high throughput in the same serving path. At 400 tokens per second per user on trillion-parameter MoE models with 400K-token context, NVIDIA co-design delivers up to 35x higher throughput per megawatt than NVIDIA GB200 NVL72 and unlocks up to 10x more revenue opportunity for agentic workloads.

For more details on the Vera Rubin platform specs and LPX, explore the following blog posts:

Discuss (0)

Tags