NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

Artificial intelligence is token-driven. Every prompt, reasoning step, and agent interaction generates tokens. Over the past year, token consumption has grown multifold and now exceeds 10 quadrillion tokens per year. And while the majority of tokens have been generated from humans interacting with AI, the new era is one in which most tokens will be generated from AI interacting with AI.

Modern agentic systems plan tasks, invoke tools, execute code, retrieve data, and coordinate across continuous multistep workflows with numerous AI agents. These interactions generate large volumes of reasoning tokens, expand KV cache, and require CPU-based sandboxed environments to test and validate results generated by accelerated computing systems. This places low latency, high throughput demands across GPUs, CPUs, scale-up domains, scale-out networks, and storage.

Delivering useful intelligence for these modern agentic systems requires fleets of purpose-built rack-scale systems that function together as one coherent AI supercomputer. This post introduces the NVIDIA Vera Rubin POD, a set of five specialized rack-scale systems built on the third-generation NVIDIA MGX rack architecture for the era of agentic AI.

Introducing NVIDIA Vera Rubin POD

Built through extreme co-design of seven chips spanning compute, networking, and storage, NVIDIA Vera Rubin introduces the most sophisticated POD-scale AI platform. The platform features 40 racks, 1.2 quadrillion transistors, nearly 20,000 NVIDIA dies, 1,152 NVIDIA Rubin GPUs, 60 exaflops, and 10 PB/s total scale-up bandwidth.

The Vera Rubin POD introduces five new distinct purpose-built rack-scale systems for agentic AI workloads that require high throughput, extreme low-latency inference, dense CPU sandboxing, and massive context memory storage. Together, these racks form one cohesive system that will power the world’s most energy- and cost-efficient data centers.

Each chip in the POD scales with a third-generation NVIDIA MGX rack, supported by an ecosystem of more than 80 partners with a global supply chain experienced in bringing large-scale AI systems to market. This enables fast deployments and seamless transitions with each NVIDIA MGX rack sharing the same power, cooling, and mechanical envelopes.

There are two types of MGX racks with copper spines designed for performance, resiliency, and energy efficiency. The MGX NVL rack is connected by NVIDIA NVLink, and the new NVIDIA MGX ETL rack is connected by one of two types of spines: NVIDIA Spectrum-X Ethernet or NVIDIA Groq 3 LPU direct chip-to-chip links.

NVIDIA Vera Rubin NVL72: Platform for the four scaling laws

NVIDIA Vera Rubin NVL72 is the core rack-scale compute engine of the latest AI factory. Integrating 72 NVIDIA Rubin GPUs and 36 NVIDIA Vera CPUs connected through a massive NVLink copper spine, it acts as a one giant GPU. NVIDIA Vera Rubin NVL72 is designed for the four scaling laws of AI: pretraining, post-training, test-time scaling, and agentic scaling. It can be optimized for complex mixture-of-experts (MoE) routing, and the heavy compute-bound context phase of AI inference. It delivers up to 4x better training performance and up to 10x better inference performance per watt, and one-tenth the token cost relative to NVIDIA Blackwell.

NVIDIA Groq 3 LPX: Inference accelerator racks

Co-designed with the NVIDIA Vera Rubin platform for the massive context and low-latency demands of agentic AI, NVIDIA Groq 3 LPX features 256 language processing units (LPUs) per rack. It pairs with Vera Rubin NVL72 to eliminate the tradeoff between high-speed interactivity and throughput. By fusing high-bandwidth SRAM-only LPUs with Rubin GPUs with large HBM capacity, the system delivers low latency and high throughput at long context lengths—supercharging user interactivity for trillion-parameter models without sacrificing system throughput. Vera Rubin NVL72 plus LPX delivers up to 35x more tokens and up to 10x more revenue opportunity for trillion-parameter models relative to Blackwell. To learn more, see Inside NVIDIA Groq 3 LPX.

NVIDIA Vera CPU rack: Agentic AI and reinforcement learning at scale

The NVIDIA Vera CPU rack integrates up to 256 NVIDIA Vera CPUs in a dense, liquid-cooled rack to provide scalable, energy-efficient capacity. A single rack can sustain over 22,500 concurrent reinforcement learning (RL) or agent sandbox environments, maximizing environments to test, execute, and validate results from the Vera Rubin NVL72 and LPX racks. Vera CPU racks provide the foundation for large-scale agentic AI and reinforcement learning, delivering results twice as efficient and 50% faster than traditional rack-scale CPUs. Learn more about how the Vera CPU delivers high-performance bandwidth and efficiency for AI factories.

NVIDIA BlueField-4 STX: AI-native storage

The NVIDIA BlueField-4 STX rack is built with the NVIDIA BlueField-4 processor, which combines the Vera CPU and ConnectX-9 SuperNIC, and scales out with Spectrum-X Ethernet networking.

It hosts the NVIDIA CMX context memory storage platform, a new class of AI-native storage infrastructure that seamlessly extends GPU context capacity across the POD and accelerates inference by offloading KV cache into a dedicated, high‑bandwidth storage layer. CMX is optimized to store and serve massive context memory (KV cache), treating temporary inference context as an AI‑native, shared data type that can be reused across turns, sessions, and agents. This delivers up to 5x higher tokens‑per‑second and up to 5x better power efficiency than traditional storage approaches.

NVIDIA Spectrum-6 SPX: Networking racks

Connecting the entire POD into a single supercomputer are the NVIDIA Spectrum-6 SPX networking racks. The Spectrum-6 SPX networking rack is engineered to accelerate east-west and north-south traffic across AI factories. Configurable with either Spectrum-X Ethernet or NVIDIA Quantum-X800 InfiniBand switches, it delivers low-latency, high-throughput rack-to-rack connectivity at scale.

The Spectrum-6 SPX rack now includes the 102.4 Tb/s Spectrum-6 switch, which features 512 lanes and 200 Gb/s co-packaged optics (CPO) in single- and multi-chip switch offerings. This silicon photonics integration replaces pluggable transceivers, delivering highest power efficiency and resiliency, low latency and jitter, and nearly perfect effective bandwidth for keeping AI workloads across compute and storage environments perfectly synchronized.

By co-designing these purpose-built racks to operate as one, the Vera Rubin POD is positioned to accelerate every component of agentic AI workloads. This begins with the streamlined NVIDIA MGX rack design that forms the foundation of every rack in the POD.

Third-generation NVIDIA MGX rack-scale architecture

Production-grade AI racks must excel across several critical areas: rapid time to volume, proven performance at scale, deep hardware-software co-design, resiliency and energy efficiency, seamless data center deployment and logistics, readiness for future architectures, and more.

The third-generation NVIDIA MGX rack-scale architecture sets the standard across all categories with engineering breakthroughs integrated throughout its mechanical, power, and cooling design.

Enabling resiliency and scalability

The NVIDIA MGX rack prioritizes PCB-based connections with its single-wide design. It unlocks completely modular, cable-free, hose-free, and fanless compute and NVLink switch trays enabling maximum reliability, scalability and serviceability. Single 19-inch-wide racks also simplify shipping and logistics accelerating deployment across AI factories.

The rack features a highly modular spine as its backplane, consisting of up to four preintegrated and prevalidated copper cable cartridges that connect each tray as one. The spine holds thousands of cables and shares the same mechanical form factor for both MGX NVL and MGX ETL racks.

Ensuring peak energy efficiency from chip to grid

At the component level, the NVIDIA MGX racks feature dynamic power steering where the systems provision power to the components that need it most. This feature can move power between the CPUs, GPUs, and NVLink switch trays to ensure components in the rack operate at peak energy efficiency, improving performance per watt.

AI training and inference workloads create large load swings. If not managed effectively, load swings can cause significant stress on the electrical grid, data center power infrastructure, and IT equipment.

To protect against power swings, MGX racks feature rack-level energy storage that cushions power transients with capacitors. When workloads demand lots of power at once, the capacitor will supply the additional power while the grid power draw remains flat or ramps up. When workloads suddenly stop, the capacitor will charge while the grid power remains flat or ramps down.

NVIDIA Vera Rubin NVL72 now introduces Intelligent Power Smoothing. It features 6x more rack-level energy storage (400 J per GPU) versus prior generations, and introduces a new closed-loop system that enables the GPUs to continuously monitor the state of charge of the capacitors to more efficiently flatten power profiles. This achieves much smaller AC power variation per minute, reduces peak current demands by up to 25%, and eliminates the need for massive battery packs to protect against large-scale power transients.

At the facility level, provisioning racks at static Max-P strands power capacity that could otherwise be used to generate tokens. It assumes homogeneous workloads that always require peak power, when in reality AI factories run a mix of workloads with varying power needs.

By provisioning MGX racks at a lower dynamic Max-Q level, data centers can maximize AI data center throughput by dynamically provisioning the correct amount of power to each rack depending on the workload. This frees up stranded power, unlocks up to 30% more GPUs in the same power budget with 45°C liquid cooling, and boosts performance per watt.

Unlocking larger energy budgets for compute

All MGX racks are universally designed to operate with 45°C (113°F) warm-water inlet temperatures so data centers already designed for liquid cooling are guaranteed a seamless transition without redesigning cooling infrastructure. Figure 5 shows a schematic representation of infrastructure layout to provide 41°C (105.8°F) water to coolant distribution units (CDUs) that in turn supply coolant at 45°C (113°F) to AI racks.

Operating at 45°C enables data centers in many climates to use ambient air and closed loop dry coolers for cooling, reducing the need for compressors, driving down PUE, and unlocking larger energy budgets for compute. Lower inlet temperatures of 35°C require data centers to divert massive amounts of facility power or water for cooling, while higher inlet temperatures maximize the amount of grid power converted directly into tokens. This yields significant data center power savings—enough to allocate up to 10% additional Vera Rubin NVL72 racks for more token generation in the same power budget.

MGX racks can be 100% liquid-cooled leveraging the same data center cooling infrastructure as prior generations. The third-generation MGX rack features new internal tray manifolds, rack UQD08 manifolds, and liquid cooled busbars supporting up to 5,000 A. The coolant used for the rack will depend on the customer and data center, but many will continue to use de-ionized water or propylene glycol-based fluid (PG25), which can last up to 10 years in a closed loop system with minimal liquid maintenance.

Open standard

Underpinning these features is an open, standardized MGX rack architecture. The first mass-production rack-scale system was with NVIDIA Blackwell in 2024. NVIDIA contributed the design to the Open Compute Project (OCP), reinforcing the commitment to open source technologies and enabling the entire ecosystem to rapidly innovate and accelerate adoption. NVIDIA has built an ecosystem of more than 80 global partners, creating a highly efficient, globally diversified supply chain that is experienced in bringing rack-scale AI systems to market.

NVIDIA MGX NVL racks

As independent third-party SemiAnalysis InferenceMax benchmarks demonstrate, NVIDIA rack-scale systems deliver 50x better performance per watt and 35x lower cost per token (NVIDIA GB300 NVL72 versus NVIDIA H200), which translates directly into higher revenues and better operating margins.

In 2024, NVIDIA shipped the first NVIDIA GB200 NVL72 rack-scale systems. In 2025, NVIDIA GB300 NVL72 was shipped. Now, NVIDIA Vera Rubin NVL72 is in full production, on track to ship in the second half of 2026.

Streamlined design of NVIDIA Vera Rubin NVL72

NVIDIA Vera Rubin NVL72 is an engineering marvel designed to drop seamlessly into existing data center footprints. It will feature nearly two times more transistors than NVIDIA GB200 NVL72 while delivering 10x more performance per watt through extreme co-design. The rack integrates 72 NVIDIA Rubin GPUs, 36 NVIDIA Vera CPUs, ConnectX-9 SuperNICs, and BlueField-4 DPUs across 18 compute trays, alongside 9 NVLink switch trays. In total, the rack houses 1.3 million individual components, nearly 1,300 chips, all packed into a single-wide third-generation NVIDIA MGX rack weighing roughly 4,000 lbs, or about the weight of a pickup truck.

Compute and NVLink Switch trays

Enabling these 72 GPUs to act as a single unified engine is the sixth-generation NVLink. It delivers 3.6 TB/s of bandwidth per GPU and 260 TB/s of scale-up bandwidth per rack—more data than the bandwidth of the entire global internet. This high-speed data transfer happens in the NVLink spine at the back of the rack, which features four modular preintegrated cable cartridges housing 5,000 copper cables over two miles in length.

Video 1. Key differences between the NVIDIA Vera Rubin compute tray and the NVIDIA Grace Blackwell compute tray

The compute trays inside the Vera Rubin NVL72 are completely redesigned from NVIDIA Blackwell. It features a robust PCB midplane designed to fit in a single-wide rack that unlocks a cable-free, hose-free, and fanless design. This simplification drops compute tray assembly time from nearly two hours to just five minutes—up to 20x faster assembly and serviceability.

Each compute tray features two NVIDIA Vera Rubin superchips with 17,000 components each—approximately five times as many components as a modern smartphone. The superchips are connected to the front modular bays that house eight ConnectX-9 SuperNICs and one BlueField-4 DPU through the PCB midplane.

Vera Rubin NVL72 introduces new rack-scale resiliency features designed to maximize uptime and goodput for large AI clusters. The NVLink switch trays support operational resiliency features that allow administrators to place switches into maintenance mode and replace them while the rack continues operating. The architecture also supports continued operation even if multiple switch trays are unavailable, minimizing disruption during maintenance.

At the silicon level, NVIDIA Rubin GPUs continuously run nondisruptive health checks and NVIDIA Vera CPUs feature in-system testing and SOCAMM memory for faster serviceability. Together, these chip-to-rack innovations reduce operational overhead and build on the resiliency improvements seen with Blackwell clusters.

NVIDIA Vera Rubin Ultra NVL576

NVIDIA Vera Rubin Ultra introduces a new two-layer all-to-all NVLink topology that will enable developers to scale-up to 576 GPUs. Vera Rubin Ultra NVL576 will combine eight separate MGX NVL racks, each with 72 Rubin Ultra GPUs, all in a single 576-GPU NVLink domain with copper and direct optical connections. It will be built using the same MGX rack-scale ecosystem for fastest time to production.

Demonstrating this massive multirack NVLink topology, Polyphe is the NVIDIA internal fully functional GB200-based prototype of the multirack NVL576 scale-up architecture.

NVIDIA Kyber NVL1152: The next generation

To scale beyond NVL576, a new MGX rack, NVIDIA Kyber, will be introduced. NVIDIA Kyber is the next-generation MGX NVL rack design that will double the NVLink domain per rack to fit 144 GPUs.

NVIDIA Kyber will scale up into a massive all-to-all NVL1152 supercomputer using similar direct optical interconnects for rack-to-rack scale-up. Kyber provides the foundation for the next era of extreme scale-up AI computing using NVIDIA Feynman. Kyber will first be introduced with Vera Rubin Ultra as a standalone NVL144 system, providing customers with three options for Vera Rubin Ultra NVLink scale-up domains: NVL72, NVL144, and the flagship NVL576.

NVIDIA MGX ETL racks

While NVIDIA MGX NVL racks provide massive scale-up compute domains, agentic AI workflows demand highly specialized nodes for extreme low-latency inference, CPU sandboxing, and accelerated context memory for KV cache. To support these diverse needs, Vera Rubin introduces the MGX ETL rack architecture, a new fully configurable MGX rack designed with a Spectrum-X Ethernet spine or a direct chip-to-chip spine leveraging the same rack-scale ecosystem as MGX NVL racks.

MGX ETL shares the same form factor and physical infrastructure as MGX NVL racks and is designed to operate under the same mechanical, power, and cooling envelope. Both racks will share the same key rack components built by the experienced MGX ecosystem: racks, chassis, trays, cable cartridges, liquid cooling manifolds, quick disconnects, busbars (standard and liquid cooled), support bracketry, side rails, power shelves, leak containment trays, tray handles, and more.

MGX ETL will use pre-integrated and pre-validated copper cable cartridges with either a Spectrum-X Ethernet spine or a direct chip-to-chip spine . MGX ETL will leverage the established MGX ecosystem and supply chain that is experienced in building the rack architecture in high volume for multiple years.

NVIDIA Spectrum-X Ethernet spine

MGX ETL with a Spectrum-X Ethernet spine will be the foundation for the Vera CPU rack and the BlueField-4 STX Storage rack in the Vera Rubin POD. The rack is highly configurable and can also be made to house up to 256 Rubin GPUs (HGX Rubin NVL8 systems), XPUs, or more.

In this design, 1U MGX ETL switch trays (based on Spectrum-6) sit in the middle of the rack. Rear-facing ports connect to the copper spine, while 32 front-facing OSFP cages provide optical transceiver connectivity to the rest of the POD.

MGX ETL leverages a Spectrum-X Multiplane topology that fans out the 200 Gb/s lanes across multiple switches, delivering full all-to-all connectivity among nodes within the rack while maintaining a single network tier. The preintegrated copper spine provides resilient, power-efficient connectivity (enabling connectivity between ETL racks with a single tier of optics) and extends purpose-built Spectrum-X Ethernet with zero jitter, noise isolation, and load balancing across the entire 256-chip rack.

Direct chip-to-chip spine

Designed for extreme low-latency inference, the LPX rack connects 256 LPUs as one. It features 32 compute trays, each with eight LPUs, connected by a direct chip-to-chip spine, which consists of two copper cable cartridges that create an intricate point-to-point topology over thousands of paired copper cable connections. These cables make up the direct chip-to-chip spine at the back of the rack consisting of the same cable cartridge mechanical form factor as other MGX racks. This massive interconnected fabric enables the entire 256-LPU rack to act as a single fast inference engine to be deployed with Vera Rubin NVL72.

When scaled to multiple LPX racks in datacenter deployments, the direct chip-to-chip links are maintained across racks enabling multiple LPX racks to operate as a single, incredibly fast inference engine.

NVIDIA Vera Rubin DSX AI factory platform

NVIDIA Vera Rubin DSX is the AI factory platform that provides a blueprint and reference design for co-designed AI infrastructure from chip to grid. It maximizes grid power to token efficiency, goodput, and accelerates time to first production.

NVIDIA Vera Rubin DSX unifies chips, systems, software libraries, APIs, and a global partner ecosystem into a single architecture that tightly integrates compute, networking, storage, power, cooling, and facility controls across the entire AI factory. This enables ecosystem partners to rapidly design, deploy, and scale gigawatt AI factories with maximum token throughput per watt and improved uptime from resiliency and energy efficiency built into the DSX platform end-to-end.

Learn more about NVIDIA Vera Rubin POD

AI infrastructure is rapidly evolving from discrete chips, standalone servers, and rack-scale systems to co-designed POD-scale supercomputers and AI factories. Modern agentic AI workloads are driving a shift toward purpose-built AI infrastructure that integrates compute, networking, and storage into a single cohesive supercomputer. The NVIDIA Vera Rubin POD unifies five rack-scale systems with key mechanical, power, and cooling innovations from the third-generation NVIDIA MGX rack, delivering scalability, resiliency, and energy efficiency.

At AI factory scale, the NVIDIA Vera Rubin DSX Reference Design and the NVIDIA Omniverse DSX Blueprint for AI factory digital twins provide a unified framework for building and operating AI factories. Together, these innovations deliver dramatic gains in performance, cost efficiency, and energy savings to power the era of agentic applications.

Join us for NVIDIA GTC 2026 and watch the GTC keynote with NVIDIA founder and CEO Jensen Huang.

NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI Supercomputer

Introducing NVIDIA Vera Rubin POD

NVIDIA Vera Rubin NVL72: Platform for the four scaling laws

NVIDIA Groq 3 LPX: Inference accelerator racks

NVIDIA Vera CPU rack: Agentic AI and reinforcement learning at scale