Data Center / Cloud

Maximize AI Factory Energy Efficiency Through Full-Stack Inference and Training Optimizations

Jun 23, 2026

By Sachin Idgunji, Kibibi Moseley and Harry Petty

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA achieves industry-leading cost efficiency for AI inference and training through extreme system co-design, integrating power, cooling, and infrastructure optimization, as well as collaboration across OEM, ODM, CSP, NCP, systems integrators, ISVs, and model ecosystem partners.
Performance per watt is maximized by combining advanced hardware (such as the NVIDIA GB200 NVL72 and liquid-cooled rack-scale designs), system-level controls (dynamic power allocation, in-rack power smoothing, real-time telemetry), and software innovations (NVIDIA DSX, NVIDIA Dynamo, NVIDIA TensorRT-LLM, and precision formats like NVFP4).
Energy-aware training techniques, including coordinated GPU speed tuning and fine-grained profiling (pioneered in collaboration with the ML.ENERGY Initiative and Megatron-LM), reduce idle time and optimize parallelism, achieving up to 25% energy savings without increasing training time and enabling greater token output within fixed power budgets.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Power can account for 40% of the operating expenses (OpEx) to run an AI factory. Each watt can be spent on overhead, data ingestion, training, or generating tokens for customers. And most sites are capped at a fixed power level provided by a regional provider. Under these conditions, performance per watt becomes a key efficiency metric that directly translates to token costs.

NVIDIA delivers the lowest cost per token for AI inference workloads and the lowest cost to train large models. This is possible through extreme co-design with power, cooling, and system infrastructure and deep collaboration with the OEM, ODM, CSP, NCP, systems integrator, ISV, and model ecosystems partners.

This post explores the levers that an operator can use to maximize performance per watt and minimize token cost in an AI factory.

Why is inference optimization important for AI factories?

Inference drives revenue, so it is the key workload to optimize. When operators increase inference throughput per watt, they directly increase the number of tokens they can sell or insights they can create. This also translates to additional revenue per unit of time.

At the hundred megawatt to gigawatt scale, even a few percentage points of throughput improvement per megawatt can translate into meaningful gains in profit.

Model architecture is also important. Mixture-of-experts (MoE) models are typically more energy efficient per unit of intelligence compared to dense models with similar total parameters because only a subset of experts is active per token. For example, DeepSeek-R1 has a large parameter count, a fraction of which is activated for each token. It achieves higher task performance at a similar or lower per‑token compute cost than dense predecessors. In other words, the MoE design delivers more intelligence for the same or less energy spent producing each token.

How to optimize for system-level energy use and performance per watt

NVIDIA architectures and platforms are engineered to increase the amount of intelligence produced per watt with each generation. Across six architecture generations, NVIDIA has improved inference throughput per megawatt by 1,000,000x.

The NVIDIA GB200 NVL72 rack-scale system increases energy efficiency through extreme co-design, with dense, direct-to-chip liquid-cooled architecture that delivers more throughput per watt. It uses in-rack power smoothing to flatten peak current spikes, enabling operators to safely deploy more GPUs within the same power and infrastructure budget.

In addition, NVIDIA DSX is an open, AI factory-scale platform that drives dynamic power allocation, real-time telemetry, and applying advanced rack-level controls that recover stranded power and increase tokens per watt.

Floating point precision adds another layer: higher‑precision calculations are generally slower and consume more energy, while narrow-precision formats like NVFP4 are more energy‑efficient and can deliver higher throughput, at equivalent accuracy to FP8.Equally important, NVIDIA Dynamo and NVIDIA TensorRT-LLM help translate these gains into real-world inference performance by boosting throughput, lowering costs, and scaling reasoning models more efficiently across GPU infrastructure.

Overall energy use is governed by the amount of computation, hardware efficiency, GPU utilization, and where the system operates on the speed/energy tradeoff frontier. As a result, system design, removing non‑GPU bottlenecks, and tuning batch size for use case, memory, and parallelism are key levers for optimizing energy use and throughput per watt.

Optimizing energy efficiency in LLM training

Large model training requires the distribution of work across multiple GPUs using a combination of multiple parallelization methods. During training, pushing for maximum iteration speed comes at the cost of very large energy consumption.

Further, individual GPU workload allocation is not perfectly balanced, leading to several GPUs in idle state while few GPUs finish computations. Energy is wasted if all GPUs sprint to the finish to complete a task only to sit idle waiting for others to finish theirs and sync.

Researchers from the ML.ENERGY Initiative at the University of Michigan have shown that tuning the processing speed for individual GPUs can reduce energy bloat in large model training. Those with more work are on the critical path (the slowest chain of tasks in the pipeline) and run at maximum speed, while those with less work are intentionally slowed down.

This achieves the following:

Idle time from GPUs finishing early is minimized
GPUs running at lower speed use less energy
End-to-end training time remains unchanged

Megatron-LM is the NVIDIA open source reference implementation for training large-scale language models. In collaboration with the ML.ENERGY team, NVIDIA continues to advance Megatron-LM training energy efficiency by profiling power and performance behavior at the kernel, scheduling, and parallelism levels, and then using those measurements to guide targeted, energy‑aware optimizations.

This work includes:

Implementing fine‑grained kernel and phase‑level energy profiling to identify compute, memory, communication, and power‑limited regions
Analyzing how parallelism configurations, pipeline imbalance, and communication overlap impact performance‑per‑watt

These insights are used to design energy‑aware scheduling and GPU frequency/power‑cap tuning aligned with the true critical path (the slowest chain of tasks in the pipeline) of training iterations. The next step is to outline how these techniques will be applied to larger scale Megatron-LM training.

This work aims to increase energy efficiency so that model training can be completed faster within the same power envelope or achieve the same training throughput with less energy. As a result, power can be redirected to additional training runs or from training to inference on the same optimized infrastructure—increasing token generation without raising total site power. To learn more, see Kareus: Joint Reduction of Dynamic and Static Energy in Large Model Training.

How does NVIDIA DSX optimize AI factory performance?

The ML.ENERGY Initiative has developed a leaderboard and benchmark for sharing observations from their measurements and a reasoning framework that explains why they observe certain energy behaviors.

These benchmarks can be tied into energy aware operations- telemetry-driven systems that show how to run an AI factory under real deployment constraints, including power cost, carbon intensity, thermals, cooling capacity, and grid limits.

NVIDIA DSX provides these energy-aware operations. The platform delivers a coordinated view across compute, racks, cooling, facility power, and workload scheduling. It provides a common operational architecture that can connect design-time simulation with runtime telemetry, helping operators understand where power is being used, where it is stranded, and how much additional useful compute can fit within a fixed site envelope.

DSX defines how AI factories are designed, built, and optimized across the full stack, from chips and systems to infrastructure software, facilities, digital twins, and partner technologies. It combines open software libraries, workflow guides, and reference designs with NVIDIA compute platforms and co-designed OEM infrastructure to enable a broad ecosystem of software and hardware solutions.

By aligning every layer through a common architecture, DSX improves tokens per watt, accelerates deployment, and strengthens operational reliability and resiliency.

DSX manages power efficiency and behaviors within the rack, at the AI factory level, and between the AI factory and the grid. DSX MaxLPS operates inside the AI factory, while DSX Flex operates between the grid and the factory.

DSX MaxLPS is a suite of technologies for maximizing AI factory throughput, including:

45°C liquid cooling: By leveraging integrated chip, thermal, and system-level innovations, operators can utilize higher 45°C inlet temperatures to improve power usage effectiveness (PUE), ensuring that a larger portion of AI factory power is redirected toward revenue-generating compute.
Dynamic power allocation: Software continuously monitors GPU and rack-level power consumption, reallocating it where needed to unlock stranded capacity and optimize overall utilization. It operates within defined power budgets, adapts to budget changes in real time, and ensures safe, compliant execution.
Advanced techniques: Integrated directly into NVIDIA GPUs, advanced methodologies boost performance per watt at iso-performance. These include power steering, optimized workload profiles for rapid GPU configuration, and software such as NVIDIA Dynamo for orchestrating inter-rack power and performance optimization.

DSX Flex is the grid-aware power orchestration layer that connects the AI factory to grid signals and external energy sources.

With power, cooling, and grid integration optimized end to end, attention can shift to extracting maximum efficiency from the workloads themselves.

The key opportunity is to use benchmarks to guide model, batching, and precision choices on top of the optimized AI factory. By aligning workload placement, scheduling, and power allocation with the most efficient compute and cooling zones, operators can stack workload-level optimizations on top of infrastructure-level gains.

This includes rebalancing workloads under a fixed power budget, identifying workloads where power can be reduced through more efficient configurations or model families, and prioritizing workloads that justify higher power budgets because they generate more revenue per token. In doing so, we continuously steer the AI factory toward maximum tokens per watt, driving down cost per token over time.

Looking ahead, AI tokenomics metrics should be regarded as first‑class design goals. Teams should explore combining digital‑twin‑driven infrastructure optimization with benchmark‑driven workload tuning.

This approach turns constrained power into a purpose‑built competitive advantage in both token capacity and revenue.

Learn more

AI factories are fundamentally limited by power, making performance per watt a key driver of token cost and profitability. Optimizing inference is critical because it directly increases revenue through higher token output, while full-stack improvements across hardware, software, and model design boost efficiency.

Training can also be made more energy-efficient without compromising speed by reducing idle GPU time. NVIDIA DSX enables real-time, energy-aware optimization across infrastructure, maximizing tokens per watt and revenue per megawatt.

To learn more about power-constrained AI factory design, simulation, operations, and NVIDIA DSX, visit the NVIDIA booth at ISC 2026.

Acknowledgments

We’d like to thank Mosharaf Chowdhury, Jae-Won Chung, and Ruofan Wu from the ML Energy initiative at the University of Michigan for their contributions.

Discuss (0)

About the Authors

About Sachin Idgunji
Sachin Idgunji is a senior director in the AI Computing group at NVIDIA working on high performance energy efficient computing for deep learning applications. Over his career, he has worked across the stack from software optimization through chip design to improve energy efficiency of computing platforms.

View all posts by Sachin Idgunji

About Kibibi Moseley
Kibibi Moseley is a senior product marketing manager at NVIDIA in Energy Efficiency, Sustainability and AI for Science. Previously she was a senior product marketing manager in Data Center and Artificial Intelligence at Intel where she drove critical launch workstreams for 2nd, 3rd, and 4th generation Intel Xeon Scalable Processors and portfolio products. She has a B.S. in industrial engineering from UC Berkeley and an M.S. in management science and engineering and MBA from Stanford University.

View all posts by Kibibi Moseley

About Harry Petty
Harry Petty is a senior technical marketing manager for HPC and AI edge applications at NVIDIA. Previously, he was a principal engineer and marketing director at Cisco Systems where he brought SDN innovations to market for hybrid cloud, multitenant security, and data center application performance. Harry has an MBA from Booth Graduate School of Business and a BS in mathematics and computer science from the University of Dayton.

View all posts by Harry Petty