Data Science

NVIDIA Vera CPU Delivers High Performance, Bandwidth, and Efficiency for AI Factories

Vera CPU render.

AI is evolving, and reasoning models are increasing token demand, placing new requirements on every layer of AI infrastructure. More than ever, compute must scale efficiently to maximize token production and improve productivity for model creators and users.

Modern GPUs operate at peak capacity, pushing throughput higher every generation, but system performance is increasingly gated by the CPU-bound serial tasks within an agentic loop–a classic example of a core computer science principle, called Amdahl’s law.

This dynamic is especially visible in two classes of workloads: reinforcement learning (RL) for training models with new specialized skills such as coding or engineering, and agentic actions, which enable AI agents to use tools like web browsers, databases, code interpreters, and other software to complete tasks in real environments, or sandboxes. 

Both workloads combine two historically separate CPU characteristics. Individual environments require strong single-threaded performance to execute complex code quickly, similar to a workstation. At the same time, modern AI systems launch thousands of these environments concurrently, creating large-scale throughput demands typical of server infrastructure.

The NVIDIA Vera CPU is designed for modern AI workloads, with key design features including:

  • Extreme single-core performance: Fast execution of individual tasks is critical, and performance must sustain under ‌constant load with many concurrent users and agentic tasks.
  • High memory and fabric bandwidth per core: To ensure consistent SLA under load that moves volumes of data efficiently for real-time analysis and context switching tasks.
  • Efficient rack-scale co-design:  AI factories must rapidly deploy and manage capacity to fulfill agentic demand while maximizing power efficiency.

Data centers built with Vera maximize AI infrastructure investments, whether Vera CPUs are directly connected to accelerators or performing tasks on standalone CPU capacity at the end of a wire.

The post-training reality

Reinforcement learning requires models to constantly evaluate their outputs, recognizing which results succeed or fail. For example, models learning to do software development generate large amounts of code using models running on accelerators, which is then shipped to clusters of CPUs to build, run, and test—acting in a feedback-reward loop (see Figure 1). 

These tasks span codebase research, compilation, runtime execution, scripting, data conversion, and other common operations. Overall, this flow requires many concurrent sandbox-like environments, each with a full complement of tools. Often, a single CPU core executes each lightly threaded case end-to-end from a set of accelerator-generated requests.

To maximize accelerator utilization and enforce rapid model iteration, the token generation and training phases of the cycle operate on a tight schedule (or policy). Often, some evaluation jobs running on a CPU finish jobs too late to influence the next step in the cycle. When this happens, it takes the model longer to learn to the same quality, and valuable tokens are wasted.

Agentic loops demand a unique blend of high single-core performance, massive data bandwidth, and deterministic execution with minimal tail latencies from the CPUs they employ. 

These requirements are a central focus of the NVIDIA Vera CPU design (Figure 2), which delivers up to 50% faster sandbox performance compared to competitive platforms, 1.2 TB/s of memory bandwidth, and 88 Olympus cores with NVIDIA Spatial Multithreading (SMT) for the task concurrency necessary for AI Factories.

NVIDIA Olympus core

The need for higher-performance cores that support AI led to the NVIDIA Olympus core, the first fully custom data center CPU core from NVIDIA. Olympus debuts in Vera alongside the second generation of the NVIDIA Scalable Coherency Fabric (SCF), originally developed for the NVIDIA Grace CPU.

Built for sustained high Instruction Per Cycle (IPC) operation on memory-intensive workloads with control-flow logic, Olympus uses a 10-wide instruction fetch and decode frontend, and a neural branch predictor capable of evaluating two taken branches per cycle. It is fully compatible with the Arm v9.2 instruction set and existing software for high performance on Arm-based containers, binaries, libraries, and operating systems.

Users can choose between performance-per-thread and thread count at runtime with NVIDIA SMT. This gives each thread stable performance, stronger isolation, and predictable tail latency under heavy load. Traditional SMT relies on time-shared resources and frequent context switching between threads, introducing performance variation.

NVIDIA Scalable Coherency Fabric and memory subsystem

The Vera CPU is built on a single monolithic compute die and fabric, with adjacent dielets implementing memory and I/O subsystems while preserving the uniformity of the compute topology.

From the point of view of an application, every core is the same practical distance to resources like other cores, caches, memory, and networking, and is provisioned with uniform, high-throughput bandwidth. Most latency‑sensitive operations remain local, avoiding unnecessary cross‑die traffic typically observed on traditional CPUs.

The runtime paths of agentic tasks, analytics operations, KV and blob caches, orchestration, and control planes are inherently unpredictable in an AI factory. In traditional implementations, the topology of the processor and the usage patterns of neighboring tasks being run on it must be considered ahead of time to maximize application performance. The design enables optimal performance without this style of tuning.

The second-generation SCF connects all 88 Olympus cores to a shared L3 cache and memory subsystem, delivering consistent latency and 3.4 TB/s of bisection bandwidth, enabling the Vera CPU to sustain over 90% of peak memory bandwidth under load. Each core is provisioned with up to 14 GB/s of memory bandwidth, roughly 3x the per-core rate of traditional data center CPUs—ensuring Extract-Transform-Load (ETL), real-time analytics, and memory-bound workloads maintain throughput when every core is active.

Feeding SCF is Vera’s second-generation LPDDR5X memory subsystem, delivering up to 1.2 TB/s of total bandwidth at less than half the memory power of traditional DDR configurations and up to 1.5 TB of capacity—a 3x increase over the prior generation. Small Outline Compression-Attached Memory Modules (SOCAMM) brings low-power memory into the data center for the first time, replacing soldered memory with detachable, upgradable modules that combine LPDDR efficiency with server-class serviceability.  

Performance across the AI factory 

All these architectural elements enable the Vera CPU to deliver up to 1.5x the agentic sandbox performance under full-socket load compared to competitive x86 platforms across compilers, scripting tools, runtime engines, compression, and agentic tool calls (Figure 3).

This advantage compounds across three dimensions. In RL post-training, a 1.5x faster sandbox returns evaluation results within tighter time windows, enabling models to capture the best gradient tokens and accelerating training cycles. 

In agentic inference, it reduces users’ wait time, improving accelerator utilization and easing pressure on KV cache offloading. 

For frontier training problems, 50% higher single-core performance means more sequential tests complete before hitting time limits, expanding the range of hard problems a model can learn from.

Agentic environments by the rack

Every AI Factory requires millions of CPU cores to enable the agentic loop of RL and tool use. To unlock the potential of AI infrastructure, deployment must be rapid. For many AI factory operators, the Vera CPU will be the first in their fleet, arriving in data centers designed for high-rack power and liquid cooling.

The new NVIDIA Vera CPU Rack offers incredible density and performance within the same planning constraints, rack infrastructure, cooling, and power as the NVL72 products being deployed today.

With a capacity of more than 22.5K sandboxes, Vera CPU Rack delivers over 4x the capacity and 2x the performance per watt of x86-based server racks (Figure 4). AI Factories deploy and manage capacity at the rack level, radically reducing build-out times and improving time-to-market for new capacity while simplifying site planning.

Each Vera CPU is connected with NVIDIA BlueField-4 SmartNICs containing dedicated Grace-based management cores, offloading networking tasks like security and management, and ensuring the most performant capacity in the system is fully available to agentic tasks.

Vera platforms and configurations 

In addition to the Vera CPU rack, NVIDIA has engineered a complete family of Vera-based platforms for the diverse workloads of modern AI factories. By delivering many choices of densities, cooling capabilities, configurations, and form factors, Vera’s design and system partners are enabling rapid deployment and capacity build-out, adaptable to the constraints of space available in any data center facility.

PlatformDescriptionScenarios
NVIDIA Vera Rubin NVL72Integrated AI factory rack tightly couples Vera host CPUs and Rubin GPUs through high-bandwidth NVIDIA NVLink-C2C and NVIDIA NVLink scale-up fabric.Large-scale AI factories, frontier model training, reasoning, and high-throughput inference.
NVIDIA Vera CPU RackLiquid-cooled (LC) CPU rack architecture with up to 4 nodes per 1U tray, scaling to 256 Vera CPUs per rack for dense, efficient compute. Build capacity rapidly at rack-scale alongside NVL72.AI factory infrastructure, agentic pipelines, orchestration layers, data processing, HPC, and CPU-dense services.
Single and dual-socket Vera platformsFlexible server platforms built around one or two Vera CPUs, with up to 1.5TB LPDDR5X per socket and 1.8TB/s NVLink-C2C between CPUs in dual-socket designs, suitable for any facility.Cloud infrastructure, enterprise, analytics, storage, HPC, NVIDIA PCIe GPU-equipped servers, and AI factories.  
NVIDIA HGX Rubin NVL8Accelerated computing platform pairing Vera host CPUs with Rubin GPUs over PCIe, enabling balanced CPU-GPU performance across multiple server designs.AI inference, technical computing, analytics, and enterprise HPC deployments.
Table 1. Vera platform options for modern AI factories

Platform availability 

Vera systems will be available from major OEMs, including Cisco, Dell, HPE, Lenovo, and Supermicro, in the second half of 2026. See the Vera CPU webpage for more details.

Learn More about the Vera CPU and Vera Rubin

NVIDIA Vera performance compared to AMD EPYC Turin and Intel Xeon 6 Granite Rapids, across a variety of workloads, including code compilation, interpreters, scripting, runtime engines, ETL, data analytics, and graph. 

Discuss (0)

Tags