Large language model (LLM) inference is a full-stack challenge. Powerful GPUs, high-bandwidth GPU-to-GPU interconnects, efficient acceleration libraries, and a highly optimized inference engine are required for high-throughput, low-latency inference.
MLPerf Inference v4.1 is the latest version of the popular and widely recognized MLPerf Inference benchmarks, developed by the MLCommons consortium. The benchmark includes many popular AI models covering diverse use cases, from LLMs and generative AI to recommenders and computer vision. The benchmarks are regularly updated to ensure market relevance.
In this round, NVIDIA submitted many great results, enabled by innovation across the NVIDIA technology stack. Highlights include:
- First submission using the NVIDIA Blackwell architecture, delivering up to 4x more performance on Llama 2 70B compared to the NVIDIA H100 Tensor Core GPU.
- NVIDIA H200 Tensor Core GPU submissions on every data center workload, delivering up to 1.5x more performance compared to the H100 submissions.
- Up to 27% more performance on H200 due to software improvements compared to preview submissions on H200 made in the prior round.
- First Llama 2 70B submissions using NVIDIA Triton Inference Server, delivering similar performance to NVIDIA TensorRT-LLM submissions.
- Up to 6.2x higher performance on the GPT-J benchmark in the edge category compared to the prior round using the NVIDIA Jetson AGX Orin platform.
This post provides a closer look at these results.
NVIDIA Blackwell shines in MLPerf Inference debut
Introduced at NVIDIA GTC 2024, the NVIDIA Blackwell architecture is a new class of AI superchip. Crafted with 208 billion transistors, and using the TSMC 4NP process tailored for NVIDIA, it is the largest GPU ever built. The Blackwell architecture also features the new second-generation Transformer Engine, which uses new Blackwell Tensor Core technology combined with TensorRT-LLM innovations, to enable fast and accurate FP4 AI inference.
In this round of MLPerf Inference, NVIDIA made its first submissions using Blackwell. On the Llama 2 70B LLM benchmark, Blackwell delivered up to 4x higher tokens per second per GPU compared to the H100 GPU.
MLPerf Inference v4.1 Llama 2 70B | Server tokens/s | Offline tokens/s |
1 NVIDIA B200 GPU | 10,756 | 11,264 |
Per-GPU increase | 4x | 3.7x |
MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. Blackwell results measured on single GPU and retrieved from entry 4.1-0074 in the Closed, Preview category. H100 results from entry 4.1-0043 in the Closed, Available category on eight H100 system and divided by GPU count for per GPU comparison. Per-GPU throughput is not a primary metric of MLPerf Inference. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
This submission made extensive use of the Blackwell FP4 Transformer Engine. This submission was also in the Closed division, meaning that the inference results delivered this performance without modifications to the model while still meeting the high accuracy requirements of the benchmark. FP4 quantization was performed using the NVIDIA TensorRT Model Optimizer library, which incorporates state-of-the-art model optimization techniques, and did not require model re-training.
NVIDIA H200 Tensor Core GPU delivers outstanding performance on every benchmark
The NVIDIA H200 GPU upgrades the NVIDIA Hopper architecture with HBM3e, the industry’s fastest AI memory. Compared to the H100, this increases memory capacity by 1.8x and memory bandwidth by 1.4x, benefiting memory-sensitive use cases.
This round, NVIDIA submitted results using eight H200 GPUs on every workload, and did so in the available category.
Benchmark | GPU | Server | Offline |
Llama 2 70B | 8 H200 (1,000 W) | 32,790 token/s | 34,864 token/s |
Mixtral 8x7B | 8 H200 (700 W) | 57,177 token/s | 59,022 token/s |
GPT-J | 19,243 token/s | 20,086 token/s | |
Stable Diffusion XL | 16.78 queries/s | 17.42 samples/s | |
DLRM v2 99% | 585,208 queries/s | 637,342 samples/s | |
DLRM v2 99.9% | 370,083 queries/s | 390,953 samples/s | |
ResNet-50 v1.5 | 632,229 queries/s | 756,960 samples/s | |
BERT 99% | 57,609 queries/s | 73,310 samples/s | |
BERT 99.9% | 51,212 queries/s | 63,950 samples/s | |
RetinaNet | 13,604 queries/s | 14,439 samples/s | |
3D U-Net | Not part of benchmark | 54.71 samples/s |
MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries: 4.1-0046, 4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
The following subsections describe the improvements achieved across several benchmarks.
Llama 2 70B
The Llama 2 70B benchmark was first introduced in the prior round and continues to represent popular, 70B-class dense LLMs.
NVIDIA also continues to enhance TensorRT-LLM software, providing users with more LLM inference performance from the GPUs they already have. Through software improvements alone, Llama 2 70B performance improved by up to 14% on H200 compared to the preview submission in the prior round.
MLPerf Llama 2 70B improvements since v4.0 | Server | Offline |
H200 (700 W) | 1.14x | 1.12x |
H100 (700 W) | 1.05x | 1.12x |
MLPerf Inference v4.0 and v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries:4.0-0062, 4.0-0070 4.1-0043, 4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Key improvements this round included XQA kernel optimizations as well as additional layer fusions.
Additionally, NVIDIA submitted Llama 2 70B results using H200 GPUs using a custom thermal solution and with the thermal design power (TDP) increased to 1,000 watts. This enabled an additional performance increase of up to 12% on the Llama 2 70B benchmark, compared to H200 configured at a 700-watt TDP.
This round, NVIDIA also submitted Llama 2 70B results using H200 GPUs running Triton Inference Server, delivering similar performance to the bare metal submission. In the server scenario, H200 with Triton Inference Server even outperformed H200 without Triton Inference Server.
MLPerf Llama 2 70B benchmark | Server | Offline |
8 H200 with Triton Inference Server | 30,128 | 31,059 |
8 H200 without Triton Inference Server | 29,228 | 31,303 |
MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.1-0048, 4.1-0050. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
This shows that organizations looking to deploy popular models need not trade functionality for performance when using Triton Inference Server.
And, finally, NVIDIA submitted Llama 2 70B results in the Open division using a single H200 GPU, showcasing the possible performance gains that can result from more extensive model-level optimizations.
First, depth pruning and width pruning were applied to the model, greatly reducing the total number of parameters in the model by intelligently removing layers and MLP intermediate dimensions that are less important to the overall model output.
Then, to recover accuracy, fine-tuning was performed on the model using the MLPerf OpenORCA development dataset. The final pruned model has 32 layers and 14,336 MLP intermediate dimensions—a significant reduction compared to the original model’s 80 layers and 28,672 intermediate dimensions.
Although model accuracy is slightly below the 99% threshold, the model is significantly smaller, enabling much higher throughput (offline) of 11,189 token/s—or almost 3x higher throughput compared to the throughput achieved in the Closed division.
MLPerf Inference v4.1, Data Center, Open Division. Result from entry 4.1-0089. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Mixtral 8x7B
A new LLM workload was added in this round, based on the Mixtral 8x7B model, developed by Mistral AI. Mixtral 8x7B employs a sparse mixture of experts (MoE) architecture with eight experts, 46.7B total parameters, with two experts and 12.9B parameters used per token.
NVIDIA submitted Mixtral 8x7B results using both the H100 and H200 GPUs, running TensorRT-LLM software and extensively used FP8 precision.
MLPerf Mixtral 8x7B benchmark | Server tokens/s | Offline tokens/s |
8 H200 | 57,177 | 59,022 |
8 H100 | 50,796 | 52,416 |
H200 advantage | 1.13x | 1.13x |
MLPerf Inference v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.1-0043, 4.1-0048. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
Only systems using NVIDIA GPUs submitted Mixtral 8x7B results. NVIDIA continues to submit great results on workloads as they are added to the MLPerf benchmark suite, showing that the NVIDIA platform delivers high performance and exceptional versatility for the large and expanding universe of AI models.
Stable Diffusion XL
This round, H200 performance was improved to generate two images per second, which is 27% more performance on Stable Diffusion XL compared to the prior round. This represented a new record for the benchmark.
MLPerf Stable Diffusion XL improvements since v4.0 | Server | Offline |
8 H200 (700 W) | 1.22x | 1.27x |
8 H100 (700 W) | 1.17x | 1.25x |
MLPerf Inference v4.0 and v4.1 Closed, Data Center. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.0-0062, 4.0-0070, 4.1-0043, 4.1-0048. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
These gains were due primarily to several key optimizations to the NVIDIA software stack, including:
- UNet FP8 support: By using TensorRT Model Optimizer, the NVIDIA submission this round used FP8 precision while meeting accuracy requirements. This represented the largest portion of the round-to-round performance gain on Hopper GPUs.
- VAE INT8 support: The NVIDIA submission this round was able to quantize certain layers to INT8 and others to FP16, compared to use of FP32 in the prior round. This improved VAE performance by 70%, translating into about a 4% end-to-end speedup.
Variational autoencoder (VAE) batch splitting: The VAE portion of the SDXL pipeline requires a very large memory footprint. By employing batch splitting, the NVIDIA submission this round was able to increase batch size from 8 to 64, improving performance.
Additionally, NVIDIA submitted SDXL results in the Open division submission, which incorporates these optimizations with Latent Consistency Model (LCM), accelerating the Closed division offline throughput by almost 5x to 11 samples/s on H200. This showcased the further performance gains possible from more extensive model-level optimizations for diffusion models.
A giant generative AI leap on Jetson AGX Orin
Jetson AGX Orin offers high AI compute performance, large unified memory, and comprehensive software for generative AI at the edge.
Through extensive software optimization, NVIDIA Jetson AGX Orin 64 GB delivers a large giant leap for generative AI models for the edge, delivering up to 6.2x more throughput and 2.4x better latency on the GPT-J 6B parameter LLM benchmark. Generative AI models at the edge can transform sensor data, such as images and videos, into real-time, actionable insights with strong contextual awareness.
Backed by the NVIDIA software stack, Jetson AGX Orin is uniquely positioned as the leading platform for running transformer models like GPT-J, vision transformers, and Stable Diffusion at the Edge. Developers can take advantage of other platform services like Jetson Generative AI Lab and Jetson Platform Services to bring great solutions to life.
GPT-J (Edge) | Single stream latency (ms) | Offline tokens/s |
Jetson AGX Orin 64 GB v4.1 | 4,176 | 64.47 |
Jetson AGX Orin 64 GB v4.0 | 10,132 | 10.35 |
MLPerf Inference v4.0 and v4.1 Closed, Edge. Results retrieved from www.mlperf.org on August 28, 2024. All results using eight GPUs and retrieved from the following entries :4.0-0072, 4.1-0051. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.
This performance boost was made possible through numerous software optimizations to TensorRT-LLM, including the use of in-flight batching, as well as the application of INT4 Activation-aware Weight Quantization (AWQ). AWQ keeps the 1% “salient weights” in higher precision FP16 and quantizes the remaining weights to four-bit integer (INT4) precision. This technique significantly reduces memory footprints, enabling larger batches to be processed at once, dramatically increasing inference throughput.
NVIDIA also submitted results of the demanding Llama 2 70B model running on Jetson AGX Orin in the Open division, demonstrating the possibilities of more extensive model optimization techniques. The submission model was the same 16B depth and width pruned model as in the H200 submission. INT4 AWQ—used in the GPT-J submission on Jetson AGX Orin in the Closed division—was also used in this submission. Model parameter pruning plus INT4 quantization together greatly shrink the memory footprint of the model weights to around only 8 GB for the Llama 2 70B model.
Conclusion
In its debut submission, NVIDIA Blackwell delivered outstanding performance—up to 4x compared to H100 on Llama 2 70B. And, among available solutions, Hopper GPUs delivered the highest multi-GPU generative AI performance and the highest performance per accelerator across all workloads, and continues to benefit from ongoing software optimization. NVIDIA Triton Inference Server also achieved great results this round, delivering similar performance to bare metal submissions. For edge and embedded AI, Jetson AGX Orin, and the rich NVIDIA software stack enable running capable models, like GPT-J 6B, with performance improving by up to 6.2x in just one round.
NVIDIA continues to innovate rapidly across the full technology stack to deliver world-class inference performance on today’s models as well as tomorrow’s, from the largest AI factories to compact, low-power edge devices.
Acknowledgments
The work of many NVIDIA employees made these outstanding results happen. We would like to acknowledge the tireless efforts of Chen-Han Yu, Kai Xu, Justin Xin, Asma Kuriparambil Thekkumpate, Linnan Wang, Wei-Ming Chen Kaiyu Xie, Shobhit Verma, Viraat Chandra, among many others.