Agentic AI / Generative AI

Building for the Rising Complexity of Agentic Systems with Extreme Co-Design

May 05, 2026

By Eduardo Alvarez, Benjamin Klieger and Graham Steele

Discuss (0)

AI-Generated Summary

Dislike

Agentic AI architectures feature hierarchical agents and sub-agents that manage large, variable context windows, tool calls, and memory statefulness, causing structurally probabilistic token consumption patterns that challenge traditional serving economics.
Real-world agentic sessions, such as those run by Claude Code, demonstrate token volumes scaling from tens of thousands to over 150,000 tokens per context window, necessitating advanced prompt caching, context compaction, and specialized hardware like NVIDIA CMX to maintain economic and latency efficiency.
The NVIDIA Vera Rubin platform employs extreme co-design across multiple specialized chips (NVL72, Vera CPU, Groq 3 LPX, NVLink 6, ConnectX-9, BlueField-4, Spectrum-X) and software optimizations (Dynamo, NVFP4, TRT-LLM WideEP, Speculative Decoding) to overcome throughput-latency tradeoffs, enabling large-scale, low-latency inference on trillion-parameter MoE models with 400k token contexts at competitive costs.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Generative AI’s explosive first chapter was defined by humans sending requests and models responding. The agentic chapter is different.

Agents don’t follow a pre-determined sequence of actions. They call tools, spawn sub-agents with different tasks and models, retain information in memory, manage their own context window, and decide for themselves when they’re finished. In doing so, these systems push token consumption, context length, and latency requirements into extremely demanding regions — exactly the pressures now shaping the NVIDIA extreme co-design stack and the NVIDIA Vera Rubin platform.

This post analyzes that evolution across three parts:

How agents consume tokens
Why their economics break under conventional serving
What an infrastructure stack purpose-built for agents looks like

Transition to agents from chatbots

As shown in Figure 1, below, the popularization of generative AI began with a simple interaction model: one user message, one chatbot message, repeat. The model responds from memory in the context window, the chat history grows linearly, and demands on the system are predictable.

The introduction of tool calling fundamentally shifts the way an AI chatbot operates. Once a model can call a calculator instead of guessing at math, the entire workload changes. Since tool responses are added directly to the context window, they introduce unpredictability to the input sequence. This happens because the size of a tool’s output depends on the specific query and the tool’s design, including how it handles relevant data. Even though the process is still bounded by a prompt and a final answer, the simple predictability of a standard chat is lost.

This dynamic becomes even more complex when we introduce agents. If a model has the power to call one tool, it also has the power to decide how many tools to use and in what order to use them. For instance, an agent tasked with drafting an email might:

Read existing correspondence
Check drive for context
Confirm a recipient’s identity
Then draft the email

This chaining is where models become agents, and where the workload shifts from “linearly predictable with probabilistic spikes” to “structurally probabilistic,”such that the shape of each agent session can behave very differently from one another.

Characteristics of agentic architectures

The modern agentic architecture is composed of a mix of agent hierarchies and optimization techniques that enable effective context management, tool usage, and task optimization:

Primary agent: Responsible for the delivery of the entire task end-to-end. May orchestrate sub-agents that tackle subtasks. Typically, the primary agent is powered by the smartest model and talks directly with the user.
Sub-agents: Spawned by the primary agent to handle narrower tasks, with ability to self-manage their context windows like the primary agent. Often, sub-agents are architecturally identical or very similar to primary agents, except with a more limited task scope from the prompt provided by the actual primary agent to the sub-agent.
File system statefulness: Additional statefulness derived from agents writing memory and tool call output to files and later searching or re-reading their contents. This serves as a method of context management and memory.
Summarization and compaction: A technique where the context window of an agent is summarized and thereby compressed to make space for new information and reduce input processing costs.

Some of the most popular agentic tools today follow similar architectures. Primary agents in tools like Claude Code frequently delegate work to sub-agents to exploit smaller context windows and parallelize tasks. Because the system must process input tokens during every single inference step, utilizing smaller contexts drives greater efficiency and results in lower input token processing costs. This architecture provides a necessary defense against a phenomenon called context rot, where an expanding context inevitably degrades output quality . When tasks grow in complexity, deliberate compaction events force sharp drops in the context window of the main agent to compensate for the inability to scale tokens infinitely.

Workload dynamics and economics of agentic systems

In their report on building a multi-agent system, Anthropic estimated that these systems consumed up to 15x more tokens than standard chat. This significant increase requires improved unit economics for tokens in order for these applications to become economically profitable at scale. Addressing this inference economics challenge requires a deep understanding of the system-level token throughput and latency requirements that govern agentic economics.

The cost and complexity of these workloads is best understood through the analysis of a real agentic session. Figure 4 provides a measured example of a Claude Code coding task. The lines on the chart represent the input sequence length (context + ISL) at every request made during the session by sub-agents (orange) and the main agent (grey). Even in a single session, the trace makes clear why long-context capacity, cache programmability, and predictable per-token latency matter as much as raw model quality.

This 33-minute session tracks 58 main-agent turns coordinating 225 sub-agent invocations. Across 283 inference requests, the context window grows from 15K tokens to a peak of 156K before a context compaction event reduces it to approximately 20K. The trace makes it clear that agent token consumption is shaped as much by agentic system behavior as by the nature of the tasks.

The primary agent accumulates input context quickly when it is not delegating or compacting context, which causes cache-read input token costs to recur every turn. Across the first 40 turns, the main agent averages roughly 85K tokens of context and accumulates around 3.5 total processed million input tokens before adding another million in the session following a compaction. These are exactly the conditions where high bandwidth memory (HBM), high-throughput platforms such as Vera Rubin NVL72 become relevant, because long-context prompts need to stay economically tractable while prefill demand continues to scale.

Prompt caching is what makes this pattern workable. Without KV cache re-use, every input token would need to be fully reprocessed. Popular API providers discount cache hits by approximately 90%, so at a 95% cache hit rate, input processing cost drops by about 85%; without prompt caching, the cost here would be roughly 6x higher. Coding agents commonly sustain 95-98% cache hit rates, especially when tool output stays small. That is why prompt caching is increasingly a systems problem rather than just an API feature: Sustaining high cache hit rates depends on efficient CPU-side KV cache management and purpose-built high-capacity context storage, such as NVIDIA CMX, to preserve long prefixes and restore them quickly as sessions scale.

The 225 requests in the sub-agent traces highlight separate inference sessions that each utilize unique contexts and specific tool definitions. Sub-agents often increase total output token volume, but they lower input cost by starting from fresh context windows and carrying forward only what is relevant to the delegated task. They can also run on smaller models, which reduces latency and cost while still preserving accuracy for narrower tasks.

Context compaction is equally important. It provides a mechanism to avoid hitting the context window limit, reduces the effects of context rot, and provides cost-management side-effects. Reducing the context window from 156K tokens to 20K forces an immediate reduction in cached input token spend and creates room for the next set of tasks.

In Figure 5, below, it is qualitatively evident that most processed tokens are retrieved from cache. Once that happens, network and memory-system behavior start to affect user-perceived latency directly, and low-latency fabrics such as NVLink 6, ConnectX-9, BlueField-4, and Spectrum-X help keep shared context accessible and reduce recomputation penalties as sessions fan out across multiple agents.

From this example, it becomes clear that agent token dynamics are quite complex and token consumption can quickly scale across primary and sub-agents. To understand the challenges of scaling these applications under this growing token demand, we must consider the delivered performance requirements.

Performance requirements of agentic workloads

Unlocking the value of agentic workloads requires high model intelligence, large context, and low latency. The faster these agents produce insights, the more exponentially valuable they become. This speed shortens R&D cycles, improves harness control, and enables complex multi-agent loops. Because the tokens enabling these capabilities are inherently expensive to process, delivered performance stands as the critical lever to making these systems both scalable and profitable.

Driving down the cost of these tokens requires producers to sustain scale in the high interactivity region for large models across large contexts. Figure 6, below, illustrates this bottleneck through a standard inference performance pareto. The left side of the curve offers high throughput but at the lower extremes of interactivity where agentic workloads cannot function.

These workloads must instead shift to the high interactivity side of the curve (right) to operate successfully. Agentic systems consume massive token volumes while demanding fast generation speeds to maintain end-user interactivity. The problem is that achieving this low latency typically causes system throughput to drop dramatically. Diminished throughput leads to prohibitive per-token costs, making agentic systems economically challenging at scale.

Breaking this bottleneck requires a complete shift in infrastructure design. Modern GPUs offer enormous compute and substantial bandwidth, but sustaining scale at low latency demands more than any single architecture can provide. The answer is extreme co-design. This approach optimizes inference across hardware specialized for each phase and delegates these unique challenges to an entire platform rather than just one processor.

Why one processor isn’t enough

These unique demands won’t be resolved by simply adding more compute FLOPs and memory capacity. The demands are due to the architectural properties of how agents work, and no single processor can solve them simultaneously.

What is needed is a platform where each bottleneck maps to specialized hardware, orchestrated as a unified system with extreme co-design (see Figure 8, above):

The Platform
- Vera Rubin NVL72 handles capacity and compute at one-tenth the cost per million tokens of Blackwell. The HBM capacity is what makes long-context pipelines tractable; the compute density absorbs prefill cost at scale.
- Vera CPU closes the tool-execution gap with lower agent latency, seamless KV cache offload, and unified CPU-GPU execution.
- Groq 3 LPX breaks the throughput-latency tradeoff. SRAM-first architecture delivers tightly bounded, low-jitter token generation — critical when variance in any single agent propagates through the entire pipeline.
- The Networking Chips (NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, and Spectrum-X Ethernet) create a unified, low-latency serving fabric for agentic workloads, so agents can coordinate faster, keep shared context accessible, and avoid costly recomputation as sessions grow.
Software Stack Components:
- Dynamo and Attention-FFN Disaggregation (AFD) creates a coherent serving path by splitting work across the best-suited processors and coordinating execution to reduce resource contention and latency. Additionally, Dynamo exposes cache programmability to the agent harness.
- NVFP4 lowers precision overhead so MoE agents can run with lower latency, higher throughput, and lower memory pressure without sacrificing intelligence.
- TRT-LLM WideEP optimizes large expert parallelism for frontier MoEs, allowing agents to provide high intelligence responses with lower latency and higher throughput.
- Speculative Decoding cuts agent response latency by generating likely tokens in parallel and verifying them quickly, accelerating low-latency inference for large models.

By combining these seven chips and a software stack through extreme co-design, the Vera Rubin platform can deliver 400+ tokens per second per user on trillion-parameter MoE models with large 400k context. This level of performance shifts the historical trade off paradigm for agents — no longer do you need to compromise quality with smaller models and limited context windows in order to deliver high per user speeds and high system throughput. In this region agentic architectures become viable products at scale rather than expensive experiments.

For more details on the Vera Rubin platform specs and LPX, explore their respective launch day blogs:

Discuss (0)

About the Authors

About Eduardo Alvarez
Eduardo Alvarez is a senior technical lead at NVIDIA, where he focuses on AI inference at scale, performance optimization, workload economic analysis, and application enablement. He has a deep background in AI systems engineering, workload optimization, and accelerated computing—focused on translating innovations into real-world applications. Before NVIDIA, Eduardo held engineering roles at various semiconductor and energy tech companies.

View all posts by Eduardo Alvarez

About Benjamin Klieger
Benjamin Klieger is an engineering manager at NVIDIA, where he works on applied AI agent architecture research with a focus on accelerating agent performance through a co-designed software and hardware stack. Benjamin also works on accelerating software development velocity through the design and deployment of frontier coding agent systems. Before NVIDIA, Benjamin was previously Head of Agents at Groq, where he led their research agent line Compound.

View all posts by Benjamin Klieger

About Graham Steele
Graham Steele is a product marketing lead at NVIDIA, where he focuses on accelerated computing solutions for the data center, AI inference at scale, and LPX. He has a deep background in product marketing, product management, and go-to-market strategy for AI and semiconductor technologies, with a focus on bringing accelerated computing platforms to enterprise customers. Before NVIDIA, Graham held product marketing and product management roles at Groq and Intel.

View all posts by Graham Steele