Data Center / Cloud

NVIDIA Blackwell Platform Sets New LLM Inference Records in MLPerf Inference v4.1

Aug 28, 2024

By Ashraf Eassa, Ashwin Nanjappa, Zhihan Jiang, Yiheng Zhang, Jun Yang, Zihao Kong and Shengliang Xu

Discuss (1)

AI-Generated Summary

Dislike

NVIDIA's latest MLPerf Inference v4.1 submission featured the NVIDIA Blackwell architecture, delivering up to 4x more performance on Llama 2 70B compared to the NVIDIA H100 Tensor Core GPU.
The NVIDIA H200 Tensor Core GPU submissions achieved up to 1.5x more performance compared to the H100 submissions and up to 27% more performance on H200 due to software improvements.
NVIDIA Triton Inference Server delivered similar performance to NVIDIA TensorRT-LLM submissions on the Llama 2 70B benchmark, and NVIDIA Jetson AGX Orin achieved up to 6.2x higher performance on the GPT-J benchmark.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Large language model (LLM) inference is a full-stack challenge. Powerful GPUs, high-bandwidth GPU-to-GPU interconnects, efficient acceleration libraries, and a highly optimized inference engine are required for high-throughput, low-latency inference.

MLPerf Inference v4.1 is the latest version of the popular and widely recognized MLPerf Inference benchmarks, developed by the MLCommons consortium. The benchmark includes many popular AI models covering diverse use cases, from LLMs and generative AI to recommenders and computer vision. The benchmarks are regularly updated to ensure market relevance.

In this round, NVIDIA submitted many great results, enabled by innovation across the NVIDIA technology stack. Highlights include:

First submission using the NVIDIA Blackwell architecture, delivering up to 4x more performance on Llama 2 70B compared to the NVIDIA H100 Tensor Core GPU.
NVIDIA H200 Tensor Core GPU submissions on every data center workload, delivering up to 1.5x more performance compared to the H100 submissions.
Up to 27% more performance on H200 due to software improvements compared to preview submissions on H200 made in the prior round.
First Llama 2 70B submissions using NVIDIA Triton Inference Server, delivering similar performance to NVIDIA TensorRT-LLM submissions.
Up to 6.2x higher performance on the GPT-J benchmark in the edge category compared to the prior round using the NVIDIA Jetson AGX Orin platform.

This post provides a closer look at these results.

NVIDIA Blackwell shines in MLPerf Inference debut

Introduced at NVIDIA GTC 2024, the NVIDIA Blackwell architecture is a new class of AI superchip. Crafted with 208 billion transistors, and using the TSMC 4NP process tailored for NVIDIA, it is the largest GPU ever built. The Blackwell architecture also features the new second-generation Transformer Engine, which uses new Blackwell Tensor Core technology combined with TensorRT-LLM innovations, to enable fast and accurate FP4 AI inference.

In this round of MLPerf Inference, NVIDIA made its first submissions using Blackwell. On the Llama 2 70B LLM benchmark, Blackwell delivered up to 4x higher tokens per second per GPU compared to the H100 GPU.

MLPerf Inference v4.1 Llama 2 70B	Server tokens/s	Offline tokens/s
1 NVIDIA B200 GPU	10,756	11,264
Per-GPU increase	4x	3.7x

Table 1. Per-GPU performance increases compared to NVIDIA Hopper on the MLPerf Llama 2 70B benchmark. H100 per-GPU throughput obtained by dividing submitted eight-GPU results by eight

MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. Blackwell results measured on single GPU and retrieved from entry 4.1-0074 in the Closed, Preview category. H100 results from entry 4.1-0043 in the Closed, Available category on eight H100 system and divided by GPU count for per GPU comparison. Per-GPU throughput is not a primary metric of MLPerf Inference. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

This submission made extensive use of the Blackwell FP4 Transformer Engine. This submission was also in the Closed division, meaning that the inference results delivered this performance without modifications to the model while still meeting the high accuracy requirements of the benchmark. FP4 quantization was performed using the NVIDIA TensorRT Model Optimizer library, which incorporates state-of-the-art model optimization techniques, and did not require model re-training.

NVIDIA H200 Tensor Core GPU delivers outstanding performance on every benchmark

The NVIDIA H200 GPU upgrades the NVIDIA Hopper architecture with HBM3e, the industry’s fastest AI memory. Compared to the H100, this increases memory capacity by 1.8x and memory bandwidth by 1.4x, benefiting memory-sensitive use cases.

This round, NVIDIA submitted results using eight H200 GPUs on every workload, and did so in the available category.

Benchmark	GPU	Server	Offline
Llama 2 70B	8 H200 (1,000 W)	32,790 token/s	34,864 token/s
Mixtral 8x7B	8 H200 (700 W)	57,177 token/s	59,022 token/s
GPT-J		19,243 token/s	20,086 token/s
Stable Diffusion XL		16.78 queries/s	17.42 samples/s
DLRM v2 99%		585,208 queries/s	637,342 samples/s
DLRM v2 99.9%		370,083 queries/s	390,953 samples/s
ResNet-50 v1.5		632,229 queries/s	756,960 samples/s
BERT 99%		57,609 queries/s	73,310 samples/s
BERT 99.9%		51,212 queries/s	63,950 samples/s
RetinaNet		13,604 queries/s	14,439 samples/s
3D U-Net		Not part of benchmark	54.71 samples/s

Table 2. NVIDIA MLPerf Inference v4.1 data center results using H200 GPUs. Llama 2 70B results based on H200 configured at 1000W, all other results using H200 at 700W

MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries: 4.1-0046, 4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

The following subsections describe the improvements achieved across several benchmarks.

Llama 2 70B

The Llama 2 70B benchmark was first introduced in the prior round and continues to represent popular, 70B-class dense LLMs.

NVIDIA also continues to enhance TensorRT-LLM software, providing users with more LLM inference performance from the GPUs they already have. Through software improvements alone, Llama 2 70B performance improved by up to 14% on H200 compared to the preview submission in the prior round.

MLPerf Llama 2 70B improvements since v4.0	Server	Offline
H200 (700 W)	1.14x	1.12x
H100 (700 W)	1.05x	1.12x

Table 3. Hopper GPU improvements on Llama 2 70B benchmark compared to prior round

MLPerf Inference v4.0 and v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries:4.0-0062, 4.0-0070 4.1-0043, 4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Key improvements this round included XQA kernel optimizations as well as additional layer fusions.

Additionally, NVIDIA submitted Llama 2 70B results using H200 GPUs using a custom thermal solution and with the thermal design power (TDP) increased to 1,000 watts. This enabled an additional performance increase of up to 12% on the Llama 2 70B benchmark, compared to H200 configured at a 700-watt TDP.

This round, NVIDIA also submitted Llama 2 70B results using H200 GPUs running Triton Inference Server, delivering similar performance to the bare metal submission. In the server scenario, H200 with Triton Inference Server even outperformed H200 without Triton Inference Server.

MLPerf Llama 2 70B benchmark	Server	Offline
8 H200 with Triton Inference Server	30,128	31,059
8 H200 without Triton Inference Server	29,228	31,303

Table 4. Performance of eight H200 GPUs with and without Triton Inference Server

MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

This shows that organizations looking to deploy popular models need not trade functionality for performance when using Triton Inference Server.

And, finally, NVIDIA submitted Llama 2 70B results in the Open division using a single H200 GPU, showcasing the possible performance gains that can result from more extensive model-level optimizations.

First, depth pruning and width pruning were applied to the model, greatly reducing the total number of parameters in the model by intelligently removing layers and MLP intermediate dimensions that are less important to the overall model output.

Then, to recover accuracy, fine-tuning was performed on the model using the MLPerf OpenORCA development dataset. The final pruned model has 32 layers and 14,336 MLP intermediate dimensions—a significant reduction compared to the original model’s 80 layers and 28,672 intermediate dimensions.

Although model accuracy is slightly below the 99% threshold, the model is significantly smaller, enabling much higher throughput (offline) of 11,189 token/s—or almost 3x higher throughput compared to the throughput achieved in the Closed division.

MLPerf Inference v4.1, Data Center, Open Division. Result from entry 4.1-0089. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Mixtral 8x7B

A new LLM workload was added in this round, based on the Mixtral 8x7B model, developed by Mistral AI. Mixtral 8x7B employs a sparse mixture of experts (MoE) architecture with eight experts, 46.7B total parameters, with two experts and 12.9B parameters used per token.

NVIDIA submitted Mixtral 8x7B results using both the H100 and H200 GPUs, running TensorRT-LLM software and extensively used FP8 precision.

MLPerf Mixtral 8x7B benchmark	Server tokens/s	Offline tokens/s
8 H200	57,177	59,022
8 H100	50,796	52,416
H200 advantage	1.13x	1.13x

Table 5. H100 and H200 performance and uplift for the latter on MLPerf Mixtral 8x7B benchmark

MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.1-0043, 4.1-0048. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Only systems using NVIDIA GPUs submitted Mixtral 8x7B results. NVIDIA continues to submit great results on workloads as they are added to the MLPerf benchmark suite, showing that the NVIDIA platform delivers high performance and exceptional versatility for the large and expanding universe of AI models.

Stable Diffusion XL

This round, H200 performance was improved to generate two images per second, which is 27% more performance on Stable Diffusion XL compared to the prior round. This represented a new record for the benchmark.

MLPerf Stable Diffusion XL improvements since v4.0	Server	Offline
8 H200 (700 W)	1.22x	1.27x
8 H100 (700 W)	1.17x	1.25x

Table 6. Stable Diffusion XL performance increases in MLPerf Inference v4.1 compared to v4.0 on both H100 and H200 GPUs

MLPerf Inference v4.0 and v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.0-0062, 4.0-0070, 4.1-0043, 4.1-0048. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

These gains were due primarily to several key optimizations to the NVIDIA software stack, including:

UNet FP8 support: By using TensorRT Model Optimizer, the NVIDIA submission this round used FP8 precision while meeting accuracy requirements. This represented the largest portion of the round-to-round performance gain on Hopper GPUs.
VAE INT8 support: The NVIDIA submission this round was able to quantize certain layers to INT8 and others to FP16, compared to use of FP32 in the prior round. This improved VAE performance by 70%, translating into about a 4% end-to-end speedup.

Variational autoencoder (VAE) batch splitting: The VAE portion of the SDXL pipeline requires a very large memory footprint. By employing batch splitting, the NVIDIA submission this round was able to increase batch size from 8 to 64, improving performance.

Additionally, NVIDIA submitted SDXL results in the Open division submission, which incorporates these optimizations with Latent Consistency Model (LCM), accelerating the Closed division offline throughput by almost 5x to 11 samples/s on H200. This showcased the further performance gains possible from more extensive model-level optimizations for diffusion models.

A giant generative AI leap on Jetson AGX Orin

Jetson AGX Orin offers high AI compute performance, large unified memory, and comprehensive software for generative AI at the edge.

Through extensive software optimization, NVIDIA Jetson AGX Orin 64 GB delivers a large giant leap for generative AI models for the edge, delivering up to 6.2x more throughput and 2.4x better latency on the GPT-J 6B parameter LLM benchmark. Generative AI models at the edge can transform sensor data, such as images and videos, into real-time, actionable insights with strong contextual awareness.

Backed by the NVIDIA software stack, Jetson AGX Orin is uniquely positioned as the leading platform for running transformer models like GPT-J, vision transformers, and Stable Diffusion at the Edge. Developers can take advantage of other platform services like Jetson Generative AI Lab and Jetson Platform Services to bring great solutions to life.

GPT-J (Edge)	Single stream latency (ms)	Offline tokens/s
Jetson AGX Orin 64 GB v4.1	4,176	64.47
Jetson AGX Orin 64 GB v4.0	10,132	10.35

Table 7. GPT-J LLM performance in the MLPerf Inference; Edge (v4.0 and v4.1) on Jetson AGX Orin

MLPerf Inference v4.0 and v4.1 Closed, Edge. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.0-0072, 4.1-0051. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

This performance boost was made possible through numerous software optimizations to TensorRT-LLM, including the use of in-flight batching, as well as the application of INT4 Activation-aware Weight Quantization (AWQ). AWQ keeps the 1% “salient weights” in higher precision FP16 and quantizes the remaining weights to four-bit integer (INT4) precision. This technique significantly reduces memory footprints, enabling larger batches to be processed at once, dramatically increasing inference throughput.

NVIDIA also submitted results of the demanding Llama 2 70B model running on Jetson AGX Orin in the Open division, demonstrating the possibilities of more extensive model optimization techniques. The submission model was the same 16B depth and width pruned model as in the H200 submission. INT4 AWQ—used in the GPT-J submission on Jetson AGX Orin in the Closed division—was also used in this submission. Model parameter pruning plus INT4 quantization together greatly shrink the memory footprint of the model weights to around only 8 GB for the Llama 2 70B model.

Conclusion

In its debut submission, NVIDIA Blackwell delivered outstanding performance—up to 4x compared to H100 on Llama 2 70B. And, among available solutions, Hopper GPUs delivered the highest multi-GPU generative AI performance and the highest performance per accelerator across all workloads, and continues to benefit from ongoing software optimization. NVIDIA Triton Inference Server also achieved great results this round, delivering similar performance to bare metal submissions. For edge and embedded AI, Jetson AGX Orin, and the rich NVIDIA software stack enable running capable models, like GPT-J 6B, with performance improving by up to 6.2x in just one round.

NVIDIA continues to innovate rapidly across the full technology stack to deliver world-class inference performance on today’s models as well as tomorrow’s, from the largest AI factories to compact, low-power edge devices.

Acknowledgments

The work of many NVIDIA employees made these outstanding results happen. We would like to acknowledge the tireless efforts of Chen-Han Yu, Kai Xu, Justin Xin, Asma Kuriparambil Thekkumpate, Linnan Wang, Wei-Ming Chen Kaiyu Xie, Shobhit Verma, Viraat Chandra, among many others.

Discuss (1)

About the Authors

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Ashwin Nanjappa
Ashwin Nanjappa is an engineering manager in the TensorRT team at NVIDIA. He leads the MLPerf Inference initiative to demonstrate the performance and energy efficiency of NVIDIA accelerators. He is also involved in improving the DL inference performance of the TensorRT library. Before joining NVIDIA, he worked on training DL models for CV, GPU-accelerated ML algorithms for depth cameras, and developing multimedia libraries for cellphones and DVD players. He has a Ph.D. in computer science from the National University of Singapore (NUS), with a focus on GPU algorithms for 3D computational geometry.

View all posts by Ashwin Nanjappa

About Zhihan Jiang
Zhihan Jiang is an engineer manager in the TensorRT team at NVIDIA and focuses on delivering world-class generative AI inference results in MLPerf Inference. Before working on MLPerf, he worked on TensorRT autonomous safety libraries and infrastructure, and NVIDIA CPU architecture modeling. Zhihan holds a master’s degree in Electrical Engineering from Stanford University, and a bachelor’s degree in Computer Engineering from Georgia Tech.

View all posts by Zhihan Jiang

About Yiheng Zhang
Yiheng Zhang is a Software Engineer on the TensorRT team at NVIDIA, where he focuses on MLPerf Inference. Yiheng has experience with autonomous driving software, Jetson platform software optimization, and general performance optimization in MLPerf Inference. Yiheng holds a master's degree in Computer Science from Stanford University.

View all posts by Yiheng Zhang

About Jun Yang
Jun Yang is a senior engineering director at NVIDIA, where he focuses on E2E AI workload optimization. Currently, he is leading the overall engineering efforts of NVIDIA TensorRT-LLM. He holds a master’s degree in Computer Architecture from the Institute of Computing Technology Chinese Academy of Sciences.

View all posts by Jun Yang

About Zihao Kong
Zihao Kong is a software engineer on the TensorRT team at NVIDIA. He focuses on delivering world-class deep learning inference performance on NVIDIA data center GPUs and Jetson platform GPUs on edge. He has experience with performance analysis and profiling and deep learning accelerators as well. He holds a bachelor’s degree in Computer Engineering from UC San Diego.

View all posts by Zihao Kong

About Shengliang Xu
Shengliang Xu is a senior deep learning engineer on the NVIDIA Algorithmic Model Optimization team focused on end-to-end optimization of deep learning model inference on NVIDIA GPU platforms. His research and development interests span model and inference system optimization of large language models and large generative models. Shengliang holds an M.S. degree in computer science from University of Washington, where he dropped out the Ph.D. program. He holds another M.S. degree and a B.S. degree both in computer science from Shanghai Jiao Tong University.

View all posts by Shengliang Xu