Top Stories

Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and cost-efficient serving across millions of GPUs worldwide – from our most powerful data center and cloud GPUs to local NVIDIA RTX workstations and even low-power edge devices with NVIDIA Jetson

Llama 3.2 VLMs support long context lengths of up to 128K text tokens as well as a single image input at a resolution of  1120 x 1120 pixels. To enable low latency responses for great user experiences, while also providing high throughput for cost-efficient serving of these models, the NVIDIA platform is optimized at every layer of the technology stack. 

Similarly, the Llama 3.2 SLMs have been optimized to run well on the millions of NVIDIA RTX PCs and workstations worldwide. They have also been quantized to allow for local deployment on edge devices with NVIDIA Jetson. For more information, see Deploying Accelerated Llama 3.2 from the Edge to the Cloud.

This post describes the full-stack optimizations that enable high throughput and low latency serving of Llama 3.2 models.

Accelerating Llama 3.2 AI inference throughput

The Llama 3.2 11B and Llama 3.2 90B models include a vision encoder with a text decoder. The encoder is optimized for high-performance inference using the NVIDIA TensorRT library and the text decoder is optimized using the NVIDIA TensorRT-LLM library.  

The visual information from the vision encoder is fused into the Llama text decoder with a cross-attention mechanism that is supported in TensorRT-LLM. This enables the Llama 3.2 VLMs to efficiently generate text by taking into account visual reasoning and understanding in context with the text input. 

TensorRT supports the vision encoder in the BF16 data format. TensorRT-LLM supports the text decoder in both FP16 and BF16 formats. The official recipe released by Meta uses a BF16 text decoder. This is also used for our baseline performance measurements. To boost performance further, NVIDIA developed a custom FP8 post-training quantization (PTQ) recipe, leveraging the fourth-generation FP8 Tensor Cores that are part of the NVIDIA Hopper architecture

This recipe, available through the TensorRT Model Optimizer library, enables higher Llama 3.2 throughput and lower latency while delivering the same accuracy across numerous benchmarks including ScienceQA, OCRBench, TextVQA, and MMMU. This means that developers can now run the model more cost-effectively. 

Optimizations from TensorRT, TensorRT-LLM, and TensorRT Model Optimizer libraries are combined and available through production-ready deployments using NVIDIA NIM microservices. 

Delivering high throughput and low latency   

Table 1 shows maximum throughput performance, representing offline use cases, across a range of input and output sequence lengths and single input image with maximum supported resolution of 1120 x 1120 pixels. Using a system based on the NVIDIA HGX H200 platform, we run the Llama 3.2 90B model on eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, connected through NVLink and NVLink Switch, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs. 

Maximum Throughput Performance – Output Tokens/Second
Eight NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths | Image Size8,000 | 2,000 | 1120×112020,000 | 2,000 | 1120×112060,000 | 2,000 | 1120×1120
BF16 Encoder with FP8 Decoder 2,6461,417480
Table 1. Maximum throughput performance with NVIDIA internal measurements

TensorRT optimized NIM for VLMs version 1.1.0 recipe. NIM server restarted between each ISL/OSL configuration to set an optimal KV cache split. Data measured on 11/14/2024. Output tokens/second is inclusive of time to generate the first token – tok/s =total generated tokens  / total latency. DGX H200, normalized to 8 GPUs (by taking the TP profile maximizing throughput per GPU, and multiplying that value by 8 to simulate a replica-parallel setup), batch size tuned for maximum node throughput, TensorRT Model Optimizer version 0.21 (pre-release), TensorRT-LLM version 0.16.0.dev, TensorRT version 10.4.0​. 

Table 2 shows minimum latency performance using the same input and output sequence lengths and input image size. 

Minimum Latency Performance – Output Tokens/Second
Eight NVIDIA H200 Tensor Core GPUs
Input | Output Sequence Lengths | Image Size8,000 | 2,000 | 1120×112020,000 | 2,000 | 1120×112060,000 | 2,000 | 1120×1120
BF16 Encoder with FP8 Decoder 646355
Table 2. Minimum latency performance with NVIDIA internal measurements

TensorRT optimized NIM for VLMs version 1.1.0 recipe. Data measured on 11/4/2024.  Output tokens/second is inclusive of time to generate the first token – tok/s = total generated tokens  / total latency. DGX H200, TP8, batch size = 1, TensorRT Model Optimizer version 0.21 (prerelease), TensorRT-LLM version 0.16.0.dev, TensorRT version 10.4.0​. 

As these results show, NVIDIA H200 GPUs with TensorRT-optimized software delivers exceptional performance on Llama 3.2 90B VLM, in both latency-optimized and throughput-optimized scenarios. 

Throughput performance of GeForce RTX 4090 with ONNX Runtime on NVIDIA RTX

For Windows deployments, NVIDIA has optimized Llama 3.2 SLMs to work efficiently using the ONNX Runtime Generative API, with a DirectML backend. Performance measurements are made using the model checkpoint available on the NGC catalog. The checkpoint is a quantized version of Llama 3.2 3B Instruct model and is quantized to AWQ INT4 using AutoAWQ and converted to ONNX using ONNX Runtime Generative API. 

Maximum Throughput Performance – Output Tokens/Second
NVIDIA GeForce RTX 4090 GPUs
Input | Output Sequence Lengths100 | 1002,000 | 1004,000 | 100
Onnx-GenAI Runtime with DirectML, BS=1253203165
Onnx-GenAI Runtime with DirectML, BS=4615374251
Table 3. Maximum throughput performance with NVIDIA internal measurements

ONNX Runtime Generative API with DirectML data measured on 10/07/2024. Output tokens/second is inclusive of time to generate the first token – tok/s =total generated tokens  / total latency. GeForce RTX 4090 GPU.

Better performance on Llama 3.2 across platforms

With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with the most performant Llama 3.2 models on any platform—from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers NVIDIA TensorRT optimized inference on Llama 3.2 and other models from NVIDIA and its partner ecosystem.

Acknowledgments

We would like to thank George Yuan, Alex Settle, and Chenjie Luo for their efforts in supporting this post.

Discuss (0)

Tags