Data Center / Cloud

Scaling Token Factory Revenue and AI Efficiency by Maximizing Performance per Watt

In the AI era, power is the ultimate constraint, and every AI factory operates within a hard limit. This makes performance per watt—the rate at which power is converted into revenue-generating intelligence—the defining metric for modern AI infrastructure.

AI data centers now operate as token factories tied directly to the energy ecosystem, where access to land, power, and shell determines deployment, and efficiency determines output. Increasing revenue within a fixed power envelope depends entirely on maximizing intelligence per watt across AI infrastructure and across the five-layer AI cake ecosystem.

This post walks through how NVIDIA architectures, systems, and AI factory software maximize performance per watt at every layer of the stack, and how those efficiency gains translate into higher token throughput and revenue per megawatt.

Compounding performance per watt across NVIDIA GPU architectures

NVIDIA architectures and platforms are engineered to increase the amount of intelligence produced per watt with each generation. Across six architecture generations, NVIDIA has improved inference throughput per megawatt by 1,000,000x (Figure 1).

To put this in perspective, if the average fuel efficiency of a car had improved as swiftly as chips over a similar time period, one gallon of gas would suffice for a trip to the moon and back.

NVIDIA Hopper introduced many architecture innovations that significantly increased energy efficiency over the prior generation. Key to these gains is the Hopper Transformer Engine, which combines fourth-generation Tensor Core technology with FP8 acceleration and software to dramatically increase performance per watt. 

NVIDIA Blackwell advanced this foundation with improvements across high-bandwidth memory (HBM), NVIDIA NVLink switch and fabric (for the NVL72 rack-scale design and NVIDIA HGX architecture), and NVFP4-enabled Tensor Cores, increasing throughput per watt. Recent SemiAnalysis InferenceX data shows that NVIDIA software optimizations and NVIDIA Blackwell Ultra GB300 NVL72 systems deliver up to 50x higher throughput per megawatt and 35x lower token cost than Hopper for DeepSeek-R1.

The NVIDIA Vera Rubin platform further boosts efficiency. Rubin GPUs, Vera CPUs, NVLink 6, and full‑rack thermals are co-designed as a single AI factory platform. Notably, the NVIDIA Vera CPU delivers 2x efficiency and 50% higher performance compared to traditional CPUs. This end-to-end approach enables up to 10x higher inference throughput per megawatt and about 10x lower token cost versus Blackwell for AI factories for Kimi K2 (32K/8K). Paired with NVIDIA Groq 3 LPX, Vera Rubin delivers up to 35x higher throughput per megawatt and 10x more revenue for trillion-parameter, high-context workloads, creating a new premium tier of ultralow-latency, high-throughput inference.

These efficiency gains are evident in AI workloads, and are also reflected in broader measures of compute performance. The HPC and supercomputing community uses the Green500 benchmark to measure high-precision (FP64) efficiency, and NVIDIA supercomputing systems top the leadership board, with nine of the top ten systems accelerated by NVIDIA technologies.

Building for efficiency with extreme co-design

Achieving these massive efficiency gains over architecture generations requires designing efficiency into every layer of the stack.

NVIDIA approaches this as an extreme co-design problem—optimizing from chip design and manufacturing, through system-level innovations like liquid cooling, to AI factory orchestration. Each layer compounds the next: efficient design reduces wasted energy, cooling shifts power to compute, and software ensures every watt produces useful work.

Engineering efficiency at the source

Efficiency begins before silicon reaches the AI factory. NVIDIA is optimizing the manufacturing pipeline itself to deliver more energy-efficient chips, faster. 

For example, the NVIDIA cuLitho library for accelerated computational lithography re‑implements the core primitives of computational lithography on GPUs. It accelerates mask synthesis by up to 70x and allows a few hundred NVIDIA DGX‑class systems to replace tens of thousands of CPU servers. In practice, this means moving from two‑week photomask cycles to overnight runs, using about one‑ninth the power and one‑eighth the physical footprint, while enabling advanced techniques like inverse lithography and curvilinear masks.

At the materials layer, NVIDIA cuEST is a CUDA-X library designed to accelerate first-principles quantum chemistry applications on NVIDIA GPUs. It turns quantum‑chemistry‑based electronic‑structure calculations into a production tool. By delivering speedups of up to 55x on density functional theory and related workloads, cuEST enables device and process engineers to explore new, lower‑leakage materials stacks at industrial scale instead of on a few handpicked candidates. The result is a pipeline where the materials and devices are tuned for lower leakage and better switching behavior, feeding directly into higher performance per watt at the transistor level.

That design‑time acceleration is amplified by GPU‑accelerated Electronic Design Automation (EDA) flows. In collaboration with other EDA leaders, NVIDIA is pushing electronic design and automation workloads onto GPUs, yielding up to 15x faster iterations on critical blocks. Faster iteration enables more opportunities to optimize design and verification flows, IR drop, clocking, and thermal hotspots. In turn, this yields floorplans and power grids that waste less energy as heat and deliver more of the input power to active compute. In other words, GPU‑accelerated EDA and manufacturing tools turn performance per watt into an explicit objective function.

Together, these advances make the design and manufacturing pipeline more efficient—reducing the time, energy, and infrastructure required to deliver next-generation chips.

Cooling as a performance per watt multiplier 

Improving performance per watt does not stop at the chip. How systems are cooled also impacts how much power is available for computation. 

NVIDIA Blackwell systems reduce cooling overhead, operating around 1.25 PUE, with about 20% of capacity air‑cooled. This shifts more energy to compute than previous generations, delivering up to 25x higher energy efficiency and over 300x better water efficiency compared to traditional air‑cooled architectures. 

NVIDIA Vera Rubin further improves energy efficiency by moving to 100% liquid cooling and tightening the die‑to‑water thermal path, enabling AI factories to run at 1.1 PUE without a proportional increase in cooling energy or water draw.  

Maintaining 45°C inlet water preserves silicon temperatures and reliability, while improved thermal transfer delivers higher performance per watt than Blackwell. In many climates, 45°C inlet water can be cooled largely with ambient air, dramatically reducing compressor runtime so chillers run less, while more of the power budget shifts from cooling to generating tokens. By contrast, lower-temperature cooling requirements depend more heavily on compressor‑based systems, diverting a larger share of the facility’s limited grid allocation into cooling instead of compute.

Translating efficiency into tokens

As tokens per watt increase, more billable AI work fits within a fixed power envelope, lowering cost per token and expanding margins. Realizing those gains requires closing the gap between grid supply and usable compute. At gigawatt scale, up to 40% of the power can be lost before it reaches compute. Power is lost through cooling inefficiencies, while traditional overprovisioning wastes capacity.  In addition, running too close to thermal or electrical limits risks faults. 

NVIDIA DSX closes this gap. Vera Rubin DSX AI Factory reference design and Omniverse digital twin blueprint treat the AI factory as a dynamic system, continuously monitoring and adjusting power, cooling, and workload behavior. Systems operate at Max-Q—the point of highest performance per watt—rather than inefficient peaks. Domain Power Service, Workload Power Profiles, and Mission Control orchestrate racks and clusters for energy efficient operation. For a 500 MW AI factory, DSX Max-Q helps ecosystem partners operate AI factories with up to 30% more GPUs within the same power envelope and higher throughput per watt, while DSX Flex aligns demand with real-time grid conditions to unlock stranded capacity.

Industry leaders demonstrate that AI factories with agentic liquid cooling and Max-Q operation deliver more tokens per watt. Every watt not spent on cooling or idle capacity becomes a watt that generates tokens—and revenue.

Video 1. Learn how NVIDIA DSX helps developers optimize token throughput, resilience, and energy use across physical, electrical, thermal, and network systems

From tokens to revenue per megawatt

Inference drives revenue. Tokens are the unit of intelligence, and throughput per megawatt defines the AI factory revenue potential. With capped power and exploding demand, operators must track throughput and token rate as closely as revenue and margin.

As models grow, context windows expand, and output lengths increase. As NVIDIA CEO Jensen Huang explained during the GTC 2026 Keynote, AI offerings will form a spectrum: free tiers attract users, mid-tier models balance scale and speed, and premium tiers with massive context windows and extreme throughput command high prices per million tokens. Smarter models command higher prices, making each move up the curve a direct revenue lever.

NVIDIA platforms like Hopper, Blackwell, and Vera Rubin push the tokens-per-watt curve upward, particularly at high-value tiers. Blackwell increased throughput 35x where monetization is concentrated. Vera Rubin moves premium tiers another order of magnitude. Extreme co-design, NVL72-scale systems, and ultralow-latency interconnects enable higher-value tiers at higher density within the same power envelope.

For operators, the metric is simple: revenue per megawatt. A one-gigawatt AI factory allocates power across free, mid, premium, and ultra tiers. The weighted product of throughput and price becomes the revenue engine. Moving to the next hardware generation can yield 5x or more revenue for the same power. Adding specialized systems, like ultralow-latency slices for engineering workloads, unlocks additional step changes. Every gain in inference performance and efficiency compounds economic output.

In today’s environment of capped power and soaring AI demand, the efficiency and throughput gains achieved with extreme co-design across NVIDIA AI infrastructure only matter if they’re captured at scale. NVIDIA Omniverse DSX Blueprint ensures that AI factories operate continuously at peak efficiency, turning every available watt into useful compute.

Learn more

Power is the ultimate constraint for modern AI: with grid capacity fixed, maximizing performance per watt—the rate at which energy is converted into revenue‑generating tokens—is the defining metric for AI Infrastructure. NVIDIA architectures and platforms are engineered to increase the amount of intelligence produced per watt with each generation. Across six architecture generations, NVIDIA has improved inference throughput per megawatt by 1,000,000x.

To learn more, explore how industry leaders are scaling intelligence within power constraints, increasing intelligence per watt, and advancing energy-efficient chip design at CERAWeek 2026.

Discuss (0)

Tags