Agentic AI / Generative AI

NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

Vera CPU image.

Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems. Post-training scaled usefulness through instruction tuning, and re-balancing GPUs for generative inference. Test-time scaling improved reasoning by giving models more generated tokens for thinking. 

Now, agentic AI and reinforcement learning scale actions. Models take more steps, call more tools, run more evaluations, and interact with execution environments to perform tasks.

This blog explains how NVIDIA Vera CPUs help AI factories to scale agentic AI and reinforcement learning by shortening CPU execution time, increasing task throughput, improving overall AI factory output, and enabling smarter, longer-thinking agents.

Why CPUs matter more in the agentic era

GPUs remain essential for model inference and training. But across agentic AI, reinforcement learning, and data-intensive AI services, much of the execution surrounding the model runs on CPUs, such as:

  • Sandboxed code and tool execution
  • Data retrieval and data processing 
  • Results computation
  • Scheduling and orchestration 

This is a precise loop: 

  • A prompt (either from a user, reasoning tokens, or a previous turn’s result) kicks off generation: “I should compile and run hello.c.”
  • The GPU generates the parameters of the tool call to be performed on the CPU: gcc -o hello hello.c ; ./hello
  • The CPU executes the tool call, producing results that are fed back to the GPUs to update weights during reinforcement learning, or used by the agent to generate the next prompt: Output: ‘Hello, world!’ – Task Returned (0) – Successful
  • The GPU generates reasoning tokens prompted by the result: “Hmm! It looks like that worked!”

As agents become more capable, they take more steps, call more tools, and run more checks. CPU time compounds across the request.

This makes the CPU part of the critical path. It’s no longer just a host processor feeding the GPU. It shapes latency, accelerator utilization, and AI factory output per watt and per dollar.

For the last decade, much of the data center CPU market optimized around cloud economics of more cores, more virtual machines, and lower cost per core. This remains important for general-purpose cloud services, but performance per core has not improved at the same rate.

This is further compounded by the end of Moore’s law, which limited generation-on-generation performance improvements in CPUs, even while GPU architectures and workloads benefited from a continuous cycle of co-optimization.

AI factories shift the metric from cores per dollar to tokens per dollar—from how many CPU cores a data center can rent, to how much AI output it can produce.

This demands a new CPU design point for AI factories:

  • High core counts to run thousands of concurrent agents, RL environments, sandboxes, and services.
  • High per-core performance, because each agentic step is gated by sequential execution. 
  • Energy-efficient memory bandwidth to keep data moving without turning CPU infrastructure into a bottleneck.

The NVIDIA Vera CPU: Built for AI agents

The NVIDIA Vera CPU is designed for the reality of modern workloads, with fast per-core performance, high concurrency, and power-efficient memory bandwidth to keep the AI factory moving.

The Vera CPU combines 88 NVIDIA Olympus cores with up to 1.2 TB/s of LPDDR5X memory bandwidth to keep cores fed through tool calls, sandboxed execution of both native code and languages like Python or JavaScript, data retrieval, data processing, and orchestration.

The key requirement is fast per-core performance, sustained at all times. Unlike cloud virtual machines, the CPU sockets stay fully loaded, doing the work of many concurrent agents. Cores that remain fast under high system load reduce task completion time, delivering faster results while freeing up resources to serve the next request.

For agents, this means lower latency across multistep requests. For reinforcement learning, this means more completed evaluations and more data from each training window, helping models reach a higher quality bar faster. For AI factories, fast cores keep accelerators from waiting on orchestration, tool execution, or data movement.

Delivering this requires the core, memory subsystem, and fabric to be designed together for branch-heavy code, high-bandwidth data movement, and predictable performance under load.

This starts with the NVIDIA custom Olympus core inside the Vera CPU.

NVIDIA Olympus core and memory subsystem

The NVIDIA Olympus core delivers up to 50% higher IPC than NVIDIA Grace, combining a wide front end, advanced branch prediction, deep out-of-order instruction scheduling, and specialized memory prefetching to sustain high throughput on branch-heavy, memory-sensitive agentic code.

Olympus uses a neural branch predictor to reduce stalls in branch-heavy code. Combined with other prediction mechanisms, it can sustain two taken branches per cycle with zero penalty, maintaining throughput for deep software stacks such as PyTorch, graph workloads, and scripting engines.

Olympus also includes a 10-wide decode unit and a deep out-of-order engine designed to sustain high instructions per cycle. Large buffers and advanced instruction scheduling help the core maintain forward progress as code paths, dependencies, and memory access patterns shift.

Sustaining high IPC under load requires keeping the cores fed with data. Vera CPUs deliver up to 1.2 TB/s of LPDDR5X memory bandwidth, sustaining over 90% of peak memory bandwidth under load. It also offers 40% lower peak memory latency compared to x86 CPUs, ensuring Olympus cores are fed on time through retrieval, analytics, sandbox execution, and orchestration.

Olympus also adds a novel graph prefetcher built for indirect memory access patterns common in graph analytics and agent memory traversal. Combined with high-memory per-core bandwidth, Vera CPUs deliver more than 3x performance on graph traversal workloads compared with x86-based architectures.

The NVIDIA Scalable Coherency Fabric (SCF) connects all cores and a unified cache across a monolithic mesh, delivering predictable latency and 50% faster core-to-core data movement compared with CPUs that fragment compute across dies. For reinforcement learning and agentic AI, that predictability helps keep evaluation loops sustained under full load.

Together, the Olympus core, NVIDIA SCF, and LPDDR5X memory subsystem enable the Vera CPU to deliver more than 1.8x higher sandbox performance across agentic workloads under full load compared with the competition, as shown in Figure 4.

System efficiency

Beyond performance, agentic AI places increasing pressure on infrastructure efficiency. As AI factories scale to thousands of CPUs, memory power can become a major contributor to platform power, cooling demand, and operating cost.

The Vera CPU pairs its architecture with high-bandwidth SOCAMM LPDDR5X memory to reduce memory power compared with traditional DDR server designs. The LPDDR5X subsystem typically consumes less than 30 watts, compared with well over 100 watts for DDR5 configurations. MRDIMM-based systems can drive memory power even higher.

With a configurable 250 W to 450 W TDP range, the Vera CPU reduces combined CPU and memory subsystem power while delivering the bandwidth needed for agentic inference and reinforcement learning environments. For AI factories, this translates into better performance per watt, lower operating costs, and more efficient use of power and cooling infrastructure.

The AI factory CPU for agents

The era of agentic AI requires a shift in CPU design—from maximizing cores per dollar to maximizing AI factory output per watt and per dollar. NVIDIA Vera CPU is the CPU for agents, combining fast per-core performance, high concurrency, and power-efficient memory bandwidth. With the custom Olympus core, LPDDR5X memory, and NVIDIA Scalable Coherency Fabric, Vera CPU delivers more than 1.8x higher agentic sandbox performance than traditional x86 architectures, helping AI factories complete more tool calls, return more evaluations, and keep accelerators moving.

Learn More about the Vera CPU, the NVIDIA Vera Rubin NVL2, and the Vera CPU benchmarking by Phoronix.

Relative performance based on measured data, and subject to change. NVIDIA Vera CPU with LPDDR5X performance baselined to the latest x86 CPU. 

Discuss (0)

Tags