Agentic AI / Generative AI

Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs

Aug 28, 2024

By Anjali Shah, Ashraf Eassa, Nick Comly and Erin Ho

Discuss (1)

AI-Generated Summary

Dislike

The NVIDIA TensorRT Model Optimizer's custom FP8 quantization recipe delivers up to 1.44x more throughput compared to Meta's official Llama 3.1 FP8 quantization recipe for the Llama 3.1 405B model.
The TensorRT Model Optimizer's FP8 recipe achieved comparable accuracy to the official Llama 3.1 FP8 recipe on benchmarks such as Massively Multitask Language Understanding (MMLU) and MT-Bench.
The INT4 AWQ technique in TensorRT Model Optimizer enables the Llama 3.1 405B model to fit on just two NVIDIA H200 GPUs, significantly reducing the required memory footprint.

AI-generated content may summarize information incompletely. Verify important information. Learn more

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases.

With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high throughput to optimize cost, a high-performance, full-stack platform is required.

This post shows how the FP8 quantization recipe of NVIDIA TensorRT Model Optimizer with NVIDIA TensorRT-LLM delivers up to 1.44x more throughput compared to the performance at model launch using Meta’s official Llama 3.1 FP8 quantization recipe. In addition, TensorRT Model Optimizer can fit Llama 3.1 405B on only two GPUs using INT4 AWQ quantization. These TensorRT Model Optimizer improvements will be available in the next release, v0.17, in early September. Learn more about advancements coming to low latency Llama 3.1 on NVIDIA GPUs.

Outstanding Llama 3.1 405B inference throughput with TensorRT-LLM

TensorRT-LLM already delivered outstanding Llama 3.1 405B inference throughput on the day the model was released. This was the result of many optimizations, such as in-flight batching, KV caching, and optimized attention kernels to accelerate inference performance with lower precision compute.

TensorRT-LLM added support for the official Llama FP8 quantization recipe at a row-wise granularity level. This involves calculating a static scaling factor for each output weight channel (before execution) and a dynamic scaling factor for each token (during execution) to preserve maximum accuracy.

TensorRT-LLM also optimizes user-defined kernels, such as the matrix multiplications from FBGEMM for the Llama 3.1 models using plug-ins that are explicitly inserted into the network graph definition at compile time to accelerate execution.

Boosting performance up to 1.44x with TensorRT Model Optimizer

To boost performance further, NVIDIA developed a custom FP8 post-training quantization (PTQ) recipe. This recipe, available through the TensorRT Model Optimizer library, enables higher Llama 3.1 405B throughput and lower latency while delivering the same accuracy. This means that developers can now run the model more cost effectively.

This custom quantization recipe incorporates FP8 KV cache quantization, as well as self-attention static quantization. The latter technique pre-computes scaling factors at compile time, rather than at run time, reducing inference compute overhead. This scaling is applied at per-tensor granularity.

Table 1 shows maximum throughput performance, representing “offline” use cases, across a range of input and output sequence lengths, running on a system based on an 8-GPU HGX H200. This system features eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs.

Maximum Throughput Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
TensorRT Model Optimizer FP8	463.1	320.1	71.5
Official Llama FP8 Recipe	399.9	230.8	49.6
Speedup	1.16x	1.39x	1.44x

Table 1. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP8, FP8 batch size tuned for maximum node throughput, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release).

Table 2 shows minimum latency performance using the same input and output sequence lengths.

Batch Size = 1 Performance – Output Tokens/Second 8 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	120,000 \| 2,048
TensorRT Model Optimizer FP8	49.6	44.2	27.2
Official Llama FP8 Recipe	37.4	33.1	22.8
Speedup	1.33x	1.33x	1.19x

Table 2. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP8, FP8 batch size = 1, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release).

As these results show, H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer software are delivering notably better performance on Llama 3.1 405B, in both latency-optimized and throughput-optimized scenarios.

Additionally, the TensorRT Model Optimizer FP8 recipe achieved accuracy on par with the official Llama 3.1 FP8 recipe for both the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Accuracy Benchmarks
	MT-Bench	MMLU
Official Llama FP8 Recipe	9.14	0.86
TensorRT Model Optimizer FP8 Recipe	9.18	0.86
TensorRT Model Optimizer INT4 AWQ	9.12	0.86

Table 3. Inference accuracy results of Llama 3.1 405B using MMLU and MT-Bench

The MT-Bench accuracy score with the new PTQ technique and measured with TensorRT-LLM is 9.18 and MMLU benchmark accuracy score is 0.86, compared to 9.14 and 0.86, respectively, using the Meta official FP8 recipe.

Fitting Llama 3.1 405B on just two H200 GPUs with INT4 AWQ

In addition to the improved FP8 recipe, developers with hardware resource constraints can use INT4 AWQ in TensorRT Model Optimizer to further compress the model. The INT4 AWQ technique reduces the required memory footprint significantly, enabling a very large LLM like Llama 3.1 405B to fit on just two H200 GPUs.

This works by compressing the weights of the model down to 4-bit integers in the linear layers (Matmuls). The activations are encoded using FP16. The activation-aware weight quantization (AWQ) method reduces quantization errors for low-bit weight-only quantization by preserving the salient weights through scaling of the salient weight channels.

To help achieve this change in precision at high performance and reduce memory usage, TensorRT-LLM features custom kernels. Tables 4 and 5 show the maximum throughput and minimum latency performance measurements. The MT-Bench accuracy score with INT4 AWQ and measured with TensorRT-LLM is 9.12 and MMLU benchmark accuracy score is 0.86, providing comparable accuracy scores to the 9.14 and 0.86, respectively, of the Llama 3.1 official FP8 recipe from Meta.

Maximum Throughput Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	60,000 \| 2,048
TensorRT Model Optimizer INT4 AWQ	75.6	28.7	16.2

Table 4. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP2, INT4 AWQ batch size tuned for maximum node throughput, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release).

Batch Size = 1 Performance – Output Tokens/Second 2 NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths	2,048 \| 128	32,768 \| 2,048	60,000 \| 2,048
TensorRT Model Optimizer INT4 AWQ	21.6	18.7	12.8

Table 5. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP2, INT4 AWQ batch size = 1, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release).

Get started

With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with most performant Llama 3.1 models on any platform—from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can leverage NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers optimized inference on Llama 3.1 models from NVIDIA and its partner ecosystem.

NVIDIA is committed to advancing, optimizing, and contributing to open-source software and models. Learn more about NVIDIA TensorRT-LLM and NVIDIA TensorRT Model Optimizer. The quantized FP8 and INT4 AWQ checkpoints from Model Optimizer will soon be available for download on Hugging Face.

Acknowledgments

We would like to thank Chenjie Luo, Lalit Vaidya, and Jie-Fang Zhang for their efforts in supporting this post.

Discuss (1)

About the Authors

About Anjali Shah
Anjali Shah is a senior deep learning scientist at NVIDIA within the Developer Advocate Engineering group helping clients build generative AI solutions. Early in her career, as a software engineer, she built mission-critical platforms for the world's leading financial services firms. She then spent several years in the healthcare sector, architecting and implementing large scale healthcare (EHR) systems. Before joining NVIDIA, she spent several years at a leading tech company, working across different industries helping clients build innovative data and AI solutions. She has a Ph.D. in biomedical informatics and applied statistics and an M.S. and B.S. in computer science and engineering.

View all posts by Anjali Shah

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Nick Comly
Nick Comly leads products for inference optimization at NVIDIA. His team focuses on pushing the capabilities and performance of the NVIDIA stack for GenAI developers. Nick received his M.S. from Stanford University, where he specialized in deep learning and optimization.

View all posts by Nick Comly

About Erin Ho
Erin Ho is the product manager for TensorRT quantization and Megatron-Core at NVIDIA, where her experience spans both training and inference. Her current focus is shaping the direction of NVIDIA's AI software to better serve the community. She holds an M.S. in computer science from National Tsing Hua University, complemented by a business degree from Carnegie Mellon University.

View all posts by Erin Ho