Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and cost-efficient serving across millions of GPUs worldwide – from our most powerful data center and cloud GPUs to local NVIDIA RTX workstations and even low-power edge devices with NVIDIA Jetson.
Llama 3.2 VLMs support long context lengths of up to 128K text tokens as well as a single image input at a resolution of 1120 x 1120 pixels. To enable low latency responses for great user experiences, while also providing high throughput for cost-efficient serving of these models, the NVIDIA platform is optimized at every layer of the technology stack.
Similarly, the Llama 3.2 SLMs have been optimized to run well on the millions of NVIDIA RTX PCs and workstations worldwide. They have also been quantized to allow for local deployment on edge devices with NVIDIA Jetson. For more information, see Deploying Accelerated Llama 3.2 from the Edge to the Cloud.
This post describes the full-stack optimizations that enable high throughput and low latency serving of Llama 3.2 models.
Accelerating Llama 3.2 AI inference throughput
The Llama 3.2 11B and Llama 3.2 90B models include a vision encoder with a text decoder. The encoder is optimized for high-performance inference using the NVIDIA TensorRT library and the text decoder is optimized using the NVIDIA TensorRT-LLM library.
The visual information from the vision encoder is fused into the Llama text decoder with a cross-attention mechanism that is supported in TensorRT-LLM. This enables the Llama 3.2 VLMs to efficiently generate text by taking into account visual reasoning and understanding in context with the text input.
TensorRT supports the vision encoder in the BF16 data format. TensorRT-LLM supports the text decoder in both FP16 and BF16 formats. The official recipe released by Meta uses a BF16 text decoder. This is also used for our baseline performance measurements. To boost performance further, NVIDIA developed a custom FP8 post-training quantization (PTQ) recipe, leveraging the fourth-generation FP8 Tensor Cores that are part of the NVIDIA Hopper architecture.
This recipe, available through the TensorRT Model Optimizer library, enables higher Llama 3.2 throughput and lower latency while delivering the same accuracy across numerous benchmarks including ScienceQA, OCRBench, TextVQA, and MMMU. This means that developers can now run the model more cost-effectively.
Optimizations from TensorRT, TensorRT-LLM, and TensorRT Model Optimizer libraries are combined and available through production-ready deployments using NVIDIA NIM microservices.
Delivering high throughput and low latency
Table 1 shows maximum throughput performance, representing offline use cases, across a range of input and output sequence lengths and single input image with maximum supported resolution of 1120 x 1120 pixels. Using a system based on the NVIDIA HGX H200 platform, we run the Llama 3.2 90B model on eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, connected through NVLink and NVLink Switch, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs.
Maximum Throughput Performance – Output Tokens/Second Eight NVIDIA H200 Tensor Core GPUs | |||
Input | Output Sequence Lengths | Image Size | 8,000 | 2,000 | 1120×1120 | 20,000 | 2,000 | 1120×1120 | 60,000 | 2,000 | 1120×1120 |
BF16 Encoder with FP8 Decoder | 2,646 | 1,417 | 480 |
TensorRT optimized NIM for VLMs version 1.1.0 recipe. NIM server restarted between each ISL/OSL configuration to set an optimal KV cache split. Data measured on 11/14/2024. Output tokens/second is inclusive of time to generate the first token – tok/s =total generated tokens / total latency. DGX H200, normalized to 8 GPUs (by taking the TP profile maximizing throughput per GPU, and multiplying that value by 8 to simulate a replica-parallel setup), batch size tuned for maximum node throughput, TensorRT Model Optimizer version 0.21 (pre-release), TensorRT-LLM version 0.16.0.dev, TensorRT version 10.4.0.
Table 2 shows minimum latency performance using the same input and output sequence lengths and input image size.
Minimum Latency Performance – Output Tokens/Second Eight NVIDIA H200 Tensor Core GPUs | |||
Input | Output Sequence Lengths | Image Size | 8,000 | 2,000 | 1120×1120 | 20,000 | 2,000 | 1120×1120 | 60,000 | 2,000 | 1120×1120 |
BF16 Encoder with FP8 Decoder | 64 | 63 | 55 |
TensorRT optimized NIM for VLMs version 1.1.0 recipe. Data measured on 11/4/2024. Output tokens/second is inclusive of time to generate the first token – tok/s = total generated tokens / total latency. DGX H200, TP8, batch size = 1, TensorRT Model Optimizer version 0.21 (prerelease), TensorRT-LLM version 0.16.0.dev, TensorRT version 10.4.0.
As these results show, NVIDIA H200 GPUs with TensorRT-optimized software delivers exceptional performance on Llama 3.2 90B VLM, in both latency-optimized and throughput-optimized scenarios.
Throughput performance of GeForce RTX 4090 with ONNX Runtime on NVIDIA RTX
For Windows deployments, NVIDIA has optimized Llama 3.2 SLMs to work efficiently using the ONNX Runtime Generative API, with a DirectML backend. Performance measurements are made using the model checkpoint available on the NGC catalog. The checkpoint is a quantized version of Llama 3.2 3B Instruct model and is quantized to AWQ INT4 using AutoAWQ and converted to ONNX using ONNX Runtime Generative API.
Maximum Throughput Performance – Output Tokens/Second NVIDIA GeForce RTX 4090 GPUs | |||
Input | Output Sequence Lengths | 100 | 100 | 2,000 | 100 | 4,000 | 100 |
Onnx-GenAI Runtime with DirectML, BS=1 | 253 | 203 | 165 |
Onnx-GenAI Runtime with DirectML, BS=4 | 615 | 374 | 251 |
ONNX Runtime Generative API with DirectML data measured on 10/07/2024. Output tokens/second is inclusive of time to generate the first token – tok/s =total generated tokens / total latency. GeForce RTX 4090 GPU.
Better performance on Llama 3.2 across platforms
With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with the most performant Llama 3.2 models on any platform—from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers NVIDIA TensorRT optimized inference on Llama 3.2 and other models from NVIDIA and its partner ecosystem.
Acknowledgments
We would like to thank George Yuan, Alex Settle, and Chenjie Luo for their efforts in supporting this post.