Data Center / Cloud

Building the Modular Foundation for AI Factories with NVIDIA MGX

May 16, 2025

By Ivan Goldwasser, Martin Hsu and Harry Petty

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA MGX is a modular reference architecture that allows partners to design multiple systems using a building-block approach, saving development costs and time to market by supporting multiple product generations and hundreds of component combinations for AI, HPC, and digital twins.
The MGX architecture addresses the increasing power density and cooling requirements of modern AI compute, such as NVIDIA Blackwell GPUs requiring up to 120 kW per rack, through liquid-cooled busbars and manifolds that enable high-density deployments without compromising performance or reliability.
NVIDIA MGX enables streamlined integration, reduced investment risks, and improved supply chain agility, allowing enterprises to future-proof their infrastructure investments, scale their AI factories efficiently, and achieve benefits such as up to 50% lower total cost of ownership and the ability to train models with up to 1.8 trillion parameters.

AI-generated content may summarize information incompletely. Verify important information. Learn more

The exponential growth of generative AI, large language models (LLMs), and high-performance computing has created unprecedented demands on data center infrastructure. Traditional server architectures struggle to accommodate the power density, thermal requirements, and rapid iteration cycles of modern accelerated computing.

This post explains the benefits of NVIDIA MGX, a modular reference architecture for accelerated computing that is redefining how enterprises and cloud providers build scalable AI factories.

Why modular architecture matters now

With NVIDIA MGX, partners can use a building-block approach to design multiple systems, saving development costs and time to market. NVIDIA MGX is designed to support multiple product generations and supports hundreds of GPU, DPU, CPU, storage, and networking combinations for AI, high-performance computing (HPC), and digital twins.

Three major trends are driving the adoption of NVIDIA MGX:

Power density and cooling: The requirements of modern AI compute are driving an increase in power density and liquid-cooled infrastructure. For example, NVIDIA Blackwell GPUs require up to 120 kW per rack with a full rack-scale solution to meet the many technical requirements that flow from that and require a full rack-scale solution. MGX addresses these demands with liquid-cooled busbars and manifolds, enabling a coolant temperature differential of less than 15° C even under 1400A loads. This enables high-density, rack-scale deployments without compromising performance or reliability.
Heterogeneous workload support: Enterprises are managing a growing diversity of workloads within single data centers, including AI post-training using 72-GPU NVIDIA GB200 NVL72 clusters, inference tasks that require test-time scaling, and digital twin simulations. The modular, mix-and-match compatibility of MGX enables organizations to tailor their infrastructure for specific workloads without the need to redesign entire racks.
Supply chain agility: MGX enables pre-integration of approximately 80% of components—including busbars, coldplates, and power whips—at the factory. This streamlines the build process, allowing ODMs to cut deployment timelines from 12 months to less than 90 days.

Building on these trends, standardized and stable architectures like MGX ensure reliable, compatible server deployments that support evolving performance needs without sacrificing interoperability. This stability is essential for enterprises looking to future-proof their infrastructure investments while maintaining the flexibility to adapt to new workloads and technologies as they emerge.

Diverse sourcing options within the MGX ecosystem minimize investment risks, shorten lead times, and reduce uncertainty by allowing flexible selection of components and avoiding vendor lock-in. By enabling partners to choose from a broad array of certified components, MGX empowers organizations to optimize their data center builds for cost, performance, and supply chain resilience.

Streamlined integration through the modular, standards-based MGX design eliminates the need for custom solutions, enabling rapid, cost-effective deployment and easier scaling. This approach not only accelerates time to market but also simplifies ongoing maintenance and upgrades, enabling enterprises to efficiently expand their AI factories as demand grows and technology evolves.

Inside the MGX rack system

Two essential types of modules are central to the NVIDIA MGX rack system: compute trays and NVLink switch trays. Each compute tray houses powerful combinations of CPUs and GPUs, such as NVIDIA Grace CPUs paired with NVIDIA Blackwell GPUs. These combinations deliver the core accelerated computing performance required for AI training, inference, and simulation workloads. NVLink switch trays, meanwhile, provide the high-speed, low-latency interconnect fabric that links these compute trays together, enabling seamless GPU-to-GPU communication and efficient scaling across the entire rack.

However, a fully functional MGX rack is much more than just compute and switch trays. To operate at the scale and efficiency demanded by modern AI factories, the system relies on a robust foundation of mechanical, electrical, and plumbing (cooling) infrastructure, including:

Mechanical components: The modular MGX rack itself provides the structural integrity and serviceability needed for high-density data center deployments. The Power Shelf Bracket secures power shelves within the rack, while the Slide Rail enables smooth installation and maintenance of rack-mounted equipment.
Electrical components: Essential for power delivery and connectivity, the MGX 54v Busbar and MGX 1400A Busbar distribute power efficiently across the rack, supporting high-performance computing loads. The 33 kW Power Shelf supplies substantial power to the system, while the MGX Power Whip offers flexible connections between power shelves and busbars. High-speed data transfer is facilitated by the MGX Highspeed Cable, ensuring optimal communication for compute and switch trays.
Plumbing or cooling components: The MGX Coldplate provides efficient liquid cooling for GPUs, maintaining optimal operating temperatures. The MGX 44RU Manifold manages coolant distribution within the rack. Quick disconnects, such as the MGX NVQD (NVIDIA quick disconnect) and MGX UQD (Universal quick disconnect), enable fast and secure connections for liquid cooling lines, simplifying maintenance and minimizing downtime.

This modular approach supports significant time savings, as standard components can be pre-installed at the factory and integrated on-site with plug-and-play power and cooling units.

MGX components in the NVIDIA GB200 NVL72 and GB300 NVL72 systems serve as the foundational infrastructure for managing power density and thermal loads, enabling these liquid-cooled, rack-scale platforms to deliver unprecedented AI performance. By integrating advanced liquid-cooled MGX architecture into the Blackwell compute nodes, NVIDIA addresses the 120 kW per rack energy demands of the GB200 NVL72 while the GB300 NVL72 72 Blackwell Ultra GPUs require even greater thermal coordination to achieve its 50x higher AI reasoning output.

This design philosophy necessitates tight collaboration across mechanical engineering teams for optimized coolant distribution, power supply experts for high-efficiency voltage regulation, and manufacturing partners to implement front-access serviceability features. These are all unified through NVIDIA chip-to-chip NVLink interconnect technology that binds 36 Grace CPUs and 72-144 GPUs into a coherent compute domain. The resulting co-engineered solution achieves 25x better energy efficiency than previous NVIDIA H100 clusters demonstrating how MGX-enabled system integration transforms raw compute power into scalable AI infrastructure.

Transforming AI factory design and deployment

NVIDIA MGX delivers tangible benefits across the data center ecosystem.

For system builders, MGX enables a reduction in R&D costs by $2–4 million per platform through the use of shared reference designs, and allows teams to certify once for the full NVIDIA software stack-including NVIDIA CUDA-X, NVIDIA AI Enterprise, and NVIDIA Omniverse.

Data center operators benefit from the ability to scale seamlessly from 8-GPU nodes to 144-GPU racks using consistent power and cooling interfaces, while also achieving up to 50% lower total cost of ownership thanks to 94% power supply efficiency and reusable plumbing.

For AI workloads, MGX empowers organizations to train models with up to 1.8 trillion parameters on coherent 72-GPU domains using NVLink switches, and to deploy inference clusters with less than 5 milliseconds of latency variance across 72-node racks.

Get started

NVIDIA MGX is more than a rack standard—it is the foundation for the AI factory era. With more than 200 ecosystem partners already adopting MGX components, enterprises gain a future-proof path to exascale AI. As NVIDIA Blackwell, NVIDIA Rubin, and beyond push computing boundaries, MGX modular architecture ensures AI factories can evolve with silicon innovations while protecting data center investments with modular upgrade paths.

Get started with NVIDIA MGX. To learn more, join NVIDIA founder and CEO Jensen Huang for the COMPUTEX 2025 keynote and attend GTC Taipei sessions at COMPUTEX 2025.

Discuss (0)

About the Authors

About Ivan Goldwasser
Ivan leads product marketing for the Data Center CPU products for NVIDIA. Previously, Ivan worked in various marketing and strategy roles in the technology sector. Ivan has an MBA from Georgetown’s McDonough School of Business and a bachelor’s degree in chemical engineering from Texas A&M University.

View all posts by Ivan Goldwasser

About Martin Hsu
Martin Hsu specializes in data center infrastructure, focusing on managing the NVIDIA MGX ecosystem with partners and suppliers. Additionally, Martin represents NVIDIA as a Board of Director in both VESA and HDMI. He holds a master's degree in Electrical Engineering from the École Nationale Supérieure des Télécom (ENST) in France and a bachelor's degree in Electrical Engineering from NCTU in Taiwan.

View all posts by Martin Hsu

About Harry Petty
Harry Petty is a senior technical marketing manager for HPC and AI edge applications at NVIDIA. Previously, he was a principal engineer and marketing director at Cisco Systems where he brought SDN innovations to market for hybrid cloud, multitenant security, and data center application performance. Harry has an MBA from Booth Graduate School of Business and a BS in mathematics and computer science from the University of Dayton.

View all posts by Harry Petty