Agentic AI / Generative AI

Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch

Sep 05, 2024

By Ashraf Eassa, Brian Slechta, Brian Pharris and Nick Comly

Discuss (0)

AI-Generated Summary

Dislike

To achieve low latency in large language models (LLMs), multi-GPU compute is necessary, utilizing techniques like tensor parallelism and speculative decoding to reduce token generation latency.
NVIDIA's TensorRT-LLM library is used to optimize LLMs on NVIDIA H200 HGX systems, with upcoming optimizations including improvements to the Medusa speculative decoding algorithm, which boosts token generation by up to 1.9x.
The Medusa algorithm uses the original model as a draft model and additional decoding heads to predict candidate tokens, resulting in 268 tokens/second/user for Llama 3.1 70B and 108 tokens/second/user for Llama 3.1 405B on HGX H200.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand.

Performance depends both on the ability for the combined GPUs to process requests as “one mighty GPU” with ultra-fast GPU-to-GPU communication and advanced software able to take full advantage of the multiple GPUs. By splitting the calculations of each model layer across the available GPUs using a technique called tensor parallelism in tandem with advanced algorithms like speculative decoding, token generation latency can be reduced, delivering an interactive user experience.

For very low latency Llama 3.1 serving, cloud services can use a full NVIDIA HGX H200 server, each incorporating eight H200 Tensor Core GPUs and four all-to-all NVLink Switch chips. Each GPU within the server can communicate at the full 900 GB/s bandwidth to any other GPU via NVLink Switch. High GPU-to-GPU fabric bandwidth is required to keep multi-GPU communication from becoming the bottleneck in interactive use cases.

To efficiently implement optimization algorithms on NVIDIA H200 HGX systems, NVIDIA TensorRT-LLM is used. TensorRT-LLM is an open-source TensorRT library that delivers state-of-the-art inference performance on the latest LLMs using a variety of techniques, including tensor parallelism and speculative decoding.

Upcoming TensorRT-LLM optimizations, including the improvement of a speculative decoding algorithm called Medusa, provide outstanding low latency performance on Llama 3.1 70B and Llama 3.1 405B of 268 tokens/second/user and 108 tokens/second/user, respectively on HGX H200.

Medusa boosts token generation by up to 1.9x on NVIDIA HGX H200

Transformer-based LLMs are auto-regressive, meaning that tokens need to be generated sequentially, limiting throughput per generation step to just one token. Typically, during LLM inference, the rate at which a single token is generated depends on how quickly model weights are loaded into memory. This means that the workload can leave the substantial Tensor Core capabilities of H200 GPUs underutilized.

Speculative decoding is a technique that increases token generation throughput per token generation step by using a “draft model” to try to predict multiple subsequent tokens beyond the next token. The target LLM then “batches” the prediction candidates and validates them in parallel with the next token, making more effective use of available parallel GPU compute resources. If any candidate sequence is accepted by the original LLM, multiple tokens are generated in the generation step and therefore accelerate token generation.

Medusa, described in this paper, is a speculative decoding algorithm that uses the original model as the draft model, avoiding the system complexity and distribution discrepancy of using a separate draft model. This technique employs additional decoding “heads”, called Medusa heads, to predict candidate tokens beyond the next token. Each Medusa head generates a distribution of tokens beyond the previous. Then a tree-based attention mechanism samples some candidate sequences for the original model to validate. The number of parallel candidate sequences is called the draft length and the average number of tokens accepted per generation step is the acceptance rate. A greater acceptance rate increases overall token generation throughput.

With Medusa, an HGX H200 is able to produce 268 tokens per second per user for Llama 3.1 70B and 108 for Llama 3.1 405B. This is over 1.5x faster on Llama 3.1 70B and over 1.9x faster on Llama 3.1 405B than without Medusa. Although there is variability in the Medusa acceptance rate between tasks depending on how the heads are fine-tuned, its overall performance is generalized across a wide range of tasks.

Medusa heads for both Llama 3.1 70B and Llama 3.1 405B were trained using the NVIDIA TensorRT Model Optimizer integration with NVIDIA NeMo framework. The Medusa head training used a frozen backbone, ensuring that use of Medusa yields identical accuracy to the base model.

NVIDIA full-stack innovation never stops

NVIDIA HGX H200 with NVLink Switch and TensorRT-LLM already delivers excellent real-time inference performance on popular and most demanding community models. To continue improving user experiences and reduce inference cost, we relentlessly innovate across every layer of the technology stack – chips, systems, software libraries, algorithms, and more.

We look forward to sharing future updates on our low latency inference performance as both our platform and the LLM ecosystem advances.

This blog is part of a series – view Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance.

Discuss (0)

About the Authors

About Ashraf Eassa
Ashraf Eassa is a senior product marketing manager at NVIDIA, focusing on deep learning, training and inference. He holds bachelor's degrees in computer science and mathematics from the University of Vermont.

View all posts by Ashraf Eassa

About Brian Slechta
Brian Slechta is a director of AI architecture in the GPU Architecture group at NVIDIA. He is passionate about pushing the boundaries of hardware and software performance in the data center for large scale AI workloads. Brian holds an M.Sc. in computer systems engineering from the University of Illinois at Urbana-Champaign.

View all posts by Brian Slechta

About Brian Pharris
Brian Pharris is a senior distinguished engineer and the technical lead for GPU-accelerated inference, shaping the architecture, performance, and scalability of some of the world's most advanced AI systems. He holds both B.S. and M.Eng. degrees in electrical engineering and computer science from MIT.

View all posts by Brian Pharris

About Nick Comly
Nick Comly leads products for inference optimization at NVIDIA. His team focuses on pushing the capabilities and performance of the NVIDIA stack for GenAI developers. Nick received his M.S. from Stanford University, where he specialized in deep learning and optimization.

View all posts by Nick Comly