Generative AI

Boosting Llama 3.1 405B Performance up to 1.44x with NVIDIA TensorRT Model Optimizer on NVIDIA H200 GPUs

The Llama 3.1 405B large language model (LLM), developed by Meta, is an open-source community model that delivers state-of-the-art performance and supports a variety of use cases. 

With 405 billion parameters and support for context lengths of up to 128K tokens, Llama 3.1 405B is also one of the most demanding LLMs to run. To deliver both low latency to optimize the user experience and high throughput to optimize cost, a high-performance, full-stack platform is required.

This post shows how the FP8 quantization recipe of NVIDIA TensorRT Model Optimizer with NVIDIA TensorRT-LLM delivers up to 1.44x more throughput compared to the performance at model launch using Meta’s official Llama 3.1 FP8 quantization recipe. In addition, TensorRT Model Optimizer can fit Llama 3.1 405B on only two GPUs using INT4 AWQ quantization. These TensorRT Model Optimizer improvements will be available in the next release, v0.17, in early September. Learn more about advancements coming to low latency Llama 3.1 on NVIDIA GPUs.

Outstanding Llama 3.1 405B inference throughput with TensorRT-LLM 

TensorRT-LLM already delivered outstanding Llama 3.1 405B inference throughput on the day the model was released. This was the result of many optimizations, such as in-flight batching, KV caching, and optimized attention kernels to accelerate inference performance with lower precision compute.

TensorRT-LLM added support for the official Llama FP8 quantization recipe at a row-wise granularity level. This involves calculating a static scaling factor for each output weight channel (before execution) and a dynamic scaling factor for each token (during execution) to preserve maximum accuracy. 

TensorRT-LLM also optimizes user-defined kernels, such as the matrix multiplications from FBGEMM for the Llama 3.1 models using plug-ins that are explicitly inserted into the network graph definition at compile time to accelerate execution.

Boosting performance up to 1.44x with TensorRT Model Optimizer 

To boost performance further, NVIDIA developed a custom FP8 post-training quantization (PTQ) recipe. This recipe, available through the TensorRT Model Optimizer library, enables higher Llama 3.1 405B throughput and lower latency while delivering the same accuracy. This means that developers can now run the model more cost effectively. 

This custom quantization recipe incorporates FP8 KV cache quantization, as well as self-attention static quantization. The latter technique pre-computes scaling factors at compile time, rather than at run time, reducing inference compute overhead. This scaling is applied at per-tensor granularity.

Table 1 shows maximum throughput performance, representing “offline” use cases, across a range of input and output sequence lengths, running on a system based on an 8-GPU HGX H200. This system features eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, and four NVLink Switches, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs. 

Maximum Throughput Performance – Output Tokens/Second
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,048120,000 | 2,048

TensorRT Model Optimizer FP8
463.1320.171.5
Official Llama FP8 Recipe399.9230.849.6
Speedup1.16x1.39x1.44x
Table 1. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements 

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP8, FP8 batch size tuned for maximum node throughput, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release)​.

Table 2 shows minimum latency performance using the same input and output sequence lengths. 

Batch Size = 1 Performance – Output Tokens/Second
8 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,048120,000 | 2,048
TensorRT Model Optimizer FP849.644.227.2
Official Llama  FP8 Recipe37.433.122.8
Speedup1.33x1.33x1.19x
Table 2. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements 

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP8, FP8 batch size = 1, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release)​. 

As these results show, H200 GPUs with TensorRT-LLM and TensorRT Model Optimizer software are delivering notably better performance on Llama 3.1 405B, in both latency-optimized and throughput-optimized scenarios. 

Additionally, the TensorRT Model Optimizer FP8 recipe achieved accuracy on par with the official Llama 3.1 FP8 recipe for both the Massively Multitask Language Understanding (MMLU) and MT-Bench benchmarks.

Accuracy Benchmarks
MT-BenchMMLU
Official Llama FP8 Recipe9.140.86
TensorRT Model Optimizer FP8 Recipe9.180.86
TensorRT Model Optimizer INT4 AWQ9.120.86
Table 3. Inference accuracy results of Llama 3.1 405B using MMLU and MT-Bench

The MT-Bench accuracy score with the new PTQ technique and measured with TensorRT-LLM is 9.18 and MMLU benchmark accuracy score is 0.86, compared to 9.14 and 0.86, respectively, using the Meta official FP8 recipe. 

Fitting Llama 3.1 405B on just two H200 GPUs with INT4 AWQ

In addition to the improved FP8 recipe, developers with hardware resource constraints can use INT4 AWQ in TensorRT Model Optimizer to further compress the model. The INT4 AWQ technique reduces the required memory footprint significantly, enabling a very large LLM like Llama 3.1 405B to fit on just two H200 GPUs. 

This works by compressing the weights of the model down to 4-bit integers in the linear layers (Matmuls). The activations are encoded using FP16. The activation-aware weight quantization (AWQ) method reduces quantization errors for low-bit weight-only quantization by preserving the salient weights through scaling of the salient weight channels. 

To help achieve this change in precision at high performance and reduce memory usage, TensorRT-LLM features custom kernels. Tables 4 and 5 show the maximum throughput and minimum latency performance measurements. The MT-Bench accuracy score with INT4 AWQ and measured with TensorRT-LLM is 9.12 and MMLU benchmark accuracy score is 0.86, providing comparable accuracy scores to the 9.14 and 0.86, respectively, of the Llama 3.1 official FP8 recipe from Meta.

Maximum Throughput Performance – Output Tokens/Second
2 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,04860,000 | 2,048
TensorRT Model Optimizer INT4 AWQ75.628.716.2
Table 4. Maximum throughput performance of Llama 3.1 405B with NVIDIA internal measurements

TensorRT Model Optimizer recipe data measured on 8/24/2024. Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP2, INT4 AWQ batch size tuned for maximum node throughput, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release)​.

Batch Size = 1 Performance – Output Tokens/Second
2 NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths2,048 | 12832,768 | 2,04860,000 | 2,048
TensorRT Model Optimizer INT4 AWQ21.618.712.8
Table 5. Minimum latency performance of Llama 3.1 405B with NVIDIA internal measurements 

TensorRT Model Optimizer recipe data measured on 8/24/2024.  Output tokens/second is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency). DGX H200, TP2, INT4 AWQ batch size = 1, TensorRT-LLM version 0.13.0.dev2024082000, TensorRT Model Optimizer v0.17.0a (pre-release).

Get started

With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with most performant Llama 3.1 models on any platform—from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can leverage NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers optimized inference on Llama 3.1 models from NVIDIA and its partner ecosystem.

NVIDIA is committed to advancing, optimizing, and contributing to open-source software and models. Learn more about NVIDIA TensorRT-LLM and NVIDIA TensorRT Model Optimizer. The quantized FP8 and INT4 AWQ checkpoints from Model Optimizer will soon be available for download on Hugging Face

Acknowledgments

We would like to thank Chenjie Luo, Lalit Vaidya, and Jie-Fang Zhang for their efforts in supporting this post.

Discuss (1)

Tags