Llama 3.2 Full-Stack Optimizations Unlock High Performance on NVIDIA GPUs

Nov 19, 2024

By Ashraf Eassa, Anjali Shah, Alexandre Milesi, Jesse Clayton, Yu-Hsuan Tseng, Anirban Ghosh and Nick Comly

AI-Generated Summary

Dislike

Meta's Llama 3.2 series includes vision language models (VLMs) with 11B and 90B parameters, and text-only small language models (SLMs) with 1B and 3B parameters, which have been optimized by NVIDIA for performance and cost efficiency across various platforms.
NVIDIA's optimizations for Llama 3.2 models utilize libraries such as TensorRT, TensorRT-LLM, and TensorRT Model Optimizer, enabling high throughput and low latency serving, with a custom FP8 post-training quantization recipe further boosting performance.
The optimized Llama 3.2 models deliver exceptional performance on NVIDIA H200 GPUs and GeForce RTX 4090 GPUs, with maximum throughput performance reaching up to 2,646 output tokens per second on eight NVIDIA H200 Tensor Core GPUs.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Meta recently released its Llama 3.2 series of vision language models (VLMs), which come in 11B parameter and 90B parameter variants. These models are multimodal, supporting both text and image inputs. In addition, Meta has launched text-only small language model (SLM) variants of Llama 3.2 with 1B and 3B parameters. NVIDIA has optimized the Llama 3.2 collection of models for great performance and cost-efficient serving across millions of GPUs worldwide – from our most powerful data center and cloud GPUs to local NVIDIA RTX workstations and even low-power edge devices with NVIDIA Jetson.

Llama 3.2 VLMs support long context lengths of up to 128K text tokens as well as a single image input at a resolution of 1120 x 1120 pixels. To enable low latency responses for great user experiences, while also providing high throughput for cost-efficient serving of these models, the NVIDIA platform is optimized at every layer of the technology stack.

Similarly, the Llama 3.2 SLMs have been optimized to run well on the millions of NVIDIA RTX PCs and workstations worldwide. They have also been quantized to allow for local deployment on edge devices with NVIDIA Jetson. For more information, see Deploying Accelerated Llama 3.2 from the Edge to the Cloud.

This post describes the full-stack optimizations that enable high throughput and low latency serving of Llama 3.2 models.

Accelerating Llama 3.2 AI inference throughput

The Llama 3.2 11B and Llama 3.2 90B models include a vision encoder with a text decoder. The encoder is optimized for high-performance inference using the NVIDIA TensorRT library and the text decoder is optimized using the NVIDIA TensorRT-LLM library.

The visual information from the vision encoder is fused into the Llama text decoder with a cross-attention mechanism that is supported in TensorRT-LLM. This enables the Llama 3.2 VLMs to efficiently generate text by taking into account visual reasoning and understanding in context with the text input.

TensorRT supports the vision encoder in the BF16 data format. TensorRT-LLM supports the text decoder in both FP16 and BF16 formats. The official recipe released by Meta uses a BF16 text decoder. This is also used for our baseline performance measurements. To boost performance further, NVIDIA developed a custom FP8 post-training quantization (PTQ) recipe, leveraging the fourth-generation FP8 Tensor Cores that are part of the NVIDIA Hopper architecture.

This recipe, available through the TensorRT Model Optimizer library, enables higher Llama 3.2 throughput and lower latency while delivering the same accuracy across numerous benchmarks including ScienceQA, OCRBench, TextVQA, and MMMU. This means that developers can now run the model more cost-effectively.

Optimizations from TensorRT, TensorRT-LLM, and TensorRT Model Optimizer libraries are combined and available through production-ready deployments using NVIDIA NIM microservices.

Delivering high throughput and low latency

Table 1 shows maximum throughput performance, representing offline use cases, across a range of input and output sequence lengths and single input image with maximum supported resolution of 1120 x 1120 pixels. Using a system based on the NVIDIA HGX H200 platform, we run the Llama 3.2 90B model on eight NVIDIA H200 Tensor Core GPUs, each with 141 GB of fast HBM3e memory, connected through NVLink and NVLink Switch, providing 900 GB/s of GPU-to-GPU bandwidth between the GPUs.

Maximum Throughput Performance – Output Tokens/Second Eight NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths \| Image Size	8,000 \| 2,000 \| 1120×1120	20,000 \| 2,000 \| 1120×1120	60,000 \| 2,000 \| 1120×1120
BF16 Encoder with FP8 Decoder	2,646	1,417	480

Table 1. Maximum throughput performance with NVIDIA internal measurements

TensorRT optimized NIM for VLMs version 1.1.0 recipe. NIM server restarted between each ISL/OSL configuration to set an optimal KV cache split. Data measured on 11/14/2024. Output tokens/second is inclusive of time to generate the first token – tok/s =total generated tokens / total latency. DGX H200, normalized to 8 GPUs (by taking the TP profile maximizing throughput per GPU, and multiplying that value by 8 to simulate a replica-parallel setup), batch size tuned for maximum node throughput, TensorRT Model Optimizer version 0.21 (pre-release), TensorRT-LLM version 0.16.0.dev, TensorRT version 10.4.0.

Table 2 shows minimum latency performance using the same input and output sequence lengths and input image size.

Minimum Latency Performance – Output Tokens/Second Eight NVIDIA H200 Tensor Core GPUs
Input \| Output Sequence Lengths \| Image Size	8,000 \| 2,000 \| 1120×1120	20,000 \| 2,000 \| 1120×1120	60,000 \| 2,000 \| 1120×1120
BF16 Encoder with FP8 Decoder	64	63	55

Table 2. Minimum latency performance with NVIDIA internal measurements

TensorRT optimized NIM for VLMs version 1.1.0 recipe. Data measured on 11/4/2024. Output tokens/second is inclusive of time to generate the first token – tok/s = total generated tokens / total latency. DGX H200, TP8, batch size = 1, TensorRT Model Optimizer version 0.21 (prerelease), TensorRT-LLM version 0.16.0.dev, TensorRT version 10.4.0.

As these results show, NVIDIA H200 GPUs with TensorRT-optimized software delivers exceptional performance on Llama 3.2 90B VLM, in both latency-optimized and throughput-optimized scenarios.

Throughput performance of GeForce RTX 4090 with ONNX Runtime on NVIDIA RTX

For Windows deployments, NVIDIA has optimized Llama 3.2 SLMs to work efficiently using the ONNX Runtime Generative API, with a DirectML backend. Performance measurements are made using the model checkpoint available on the NGC catalog. The checkpoint is a quantized version of Llama 3.2 3B Instruct model and is quantized to AWQ INT4 using AutoAWQ and converted to ONNX using ONNX Runtime Generative API.

Maximum Throughput Performance – Output Tokens/Second NVIDIA GeForce RTX 4090 GPUs
Input \| Output Sequence Lengths	100 \| 100	2,000 \| 100	4,000 \| 100
Onnx-GenAI Runtime with DirectML, BS=1	253	203	165
Onnx-GenAI Runtime with DirectML, BS=4	615	374	251

Table 3. Maximum throughput performance with NVIDIA internal measurements

ONNX Runtime Generative API with DirectML data measured on 10/07/2024. Output tokens/second is inclusive of time to generate the first token – tok/s =total generated tokens / total latency. GeForce RTX 4090 GPU.

Better performance on Llama 3.2 across platforms

With the NVIDIA accelerated computing platform, you can build models and supercharge your applications with the most performant Llama 3.2 models on any platform—from the data center and cloud to local workstations. Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers NVIDIA TensorRT optimized inference on Llama 3.2 and other models from NVIDIA and its partner ecosystem.

Acknowledgments

We would like to thank George Yuan, Alex Settle, and Chenjie Luo for their efforts in supporting this post.

Discuss (1)

About the Authors

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Anjali Shah
Anjali Shah is a senior deep learning scientist at NVIDIA within the Developer Advocate Engineering group helping clients build generative AI solutions. Early in her career, as a software engineer, she built mission-critical platforms for the world's leading financial services firms. She then spent several years in the healthcare sector, architecting and implementing large scale healthcare (EHR) systems. Before joining NVIDIA, she spent several years at a leading tech company, working across different industries helping clients build innovative data and AI solutions. She has a Ph.D. in biomedical informatics and applied statistics and an M.S. and B.S. in computer science and engineering.

View all posts by Anjali Shah

About Alexandre Milesi
Alexandre Milesi is a deep learning algorithms engineer at NVIDIA. He holds a master’s in machine learning from the UTC, France, and a master’s in robotics and multi-agent systems from Sorbonne University, France. Before joining NVIDIA, Alexandre was an affiliate researcher at Berkeley Lab, solving electrical grid problems using deep reinforcement learning. At NVIDIA, his work is focused on DL algorithms for drug discovery and computer vision, including equivariant graph neural networks.

View all posts by Alexandre Milesi

About Jesse Clayton
Jesse Clayton is a principal product marketing manager for AI infrastructure at NVIDIA. His 25 year career in technology has spanned product management and product marketing, business development, systems software, and embedded systems development. He holds a bachelors degree in electrical and computer engineering from the University of Colorado at Boulder.

View all posts by Jesse Clayton

About Yu-Hsuan Tseng
Yu-Hsuan Tseng is a senior deep learning performance architect within the Compute Arch group at NVIDIA. She focuses on analyzing deep learning network performance on NVIDIA GPUs to drive the performance of NVIDIA products to the next level. Prior to this, she was a main developer of the NVIDIA internal deep learning network performance model. She holds a master’s degree in Electrical and Computer Engineering from University of Urbana Champaign, and a bachelor’s degree in Electrical Engineering from National Taiwan University.

View all posts by Yu-Hsuan Tseng

About Anirban Ghosh
Anirban Ghosh is a senior deep learning architect at NVIDIA in the Deep Learning Compute group. He focuses on improving deep learning performance through hardware, software, and network optimizations to enhance AI system efficiency and scalability. Anirban holds a master's degree in Electrical and Computer Engineering from Carnegie Mellon University and a bachelor's degree in Electronics and Communication Engineering from the National Institute of Technology, Karnataka, India.

View all posts by Anirban Ghosh

About Nick Comly
Nick Comly leads products for inference optimization at NVIDIA. His team focuses on pushing the capabilities and performance of the NVIDIA stack for GenAI developers. Nick received his M.S. from Stanford University, where he specialized in deep learning and optimization.

View all posts by Nick Comly