As large language models (LLMs) continue to grow in size and complexity, multi-GPU compute is a must-have to deliver the low latency and high throughput that real-time generative AI applications demand.
Performance depends both on the ability for the combined GPUs to process requests as “one mighty GPU” with ultra-fast GPU-to-GPU communication and advanced software able to take full advantage of the multiple GPUs. By splitting the calculations of each model layer across the available GPUs using a technique called tensor parallelism in tandem with advanced algorithms like speculative decoding, token generation latency can be reduced, delivering an interactive user experience.
For very low latency Llama 3.1 serving, cloud services can use a full NVIDIA HGX H200 server, each incorporating eight H200 Tensor Core GPUs and four all-to-all NVLink Switch chips. Each GPU within the server can communicate at the full 900 GB/s bandwidth to any other GPU via NVLink Switch. High GPU-to-GPU fabric bandwidth is required to keep multi-GPU communication from becoming the bottleneck in interactive use cases.
To efficiently implement optimization algorithms on NVIDIA H200 HGX systems, NVIDIA TensorRT-LLM is used. TensorRT-LLM is an open-source TensorRT library that delivers state-of-the-art inference performance on the latest LLMs using a variety of techniques, including tensor parallelism and speculative decoding.
Upcoming TensorRT-LLM optimizations, including the improvement of a speculative decoding algorithm called Medusa, provide outstanding low latency performance on Llama 3.1 70B and Llama 3.1 405B of 268 tokens/second/user and 108 tokens/second/user, respectively on HGX H200.
Medusa boosts token generation by up to 1.9x on NVIDIA HGX H200
Transformer-based LLMs are auto-regressive, meaning that tokens need to be generated sequentially, limiting throughput per generation step to just one token. Typically, during LLM inference, the rate at which a single token is generated depends on how quickly model weights are loaded into memory. This means that the workload can leave the substantial Tensor Core capabilities of H200 GPUs underutilized.
Speculative decoding is a technique that increases token generation throughput per token generation step by using a “draft model” to try to predict multiple subsequent tokens beyond the next token. The target LLM then “batches” the prediction candidates and validates them in parallel with the next token, making more effective use of available parallel GPU compute resources. If any candidate sequence is accepted by the original LLM, multiple tokens are generated in the generation step and therefore accelerate token generation.
Medusa, described in this paper, is a speculative decoding algorithm that uses the original model as the draft model, avoiding the system complexity and distribution discrepancy of using a separate draft model. This technique employs additional decoding “heads”, called Medusa heads, to predict candidate tokens beyond the next token. Each Medusa head generates a distribution of tokens beyond the previous. Then a tree-based attention mechanism samples some candidate sequences for the original model to validate. The number of parallel candidate sequences is called the draft length and the average number of tokens accepted per generation step is the acceptance rate. A greater acceptance rate increases overall token generation throughput.
With Medusa, an HGX H200 is able to produce 268 tokens per second per user for Llama 3.1 70B and 108 for Llama 3.1 405B. This is over 1.5x faster on Llama 3.1 70B and over 1.9x faster on Llama 3.1 405B than without Medusa. Although there is variability in the Medusa acceptance rate between tasks depending on how the heads are fine-tuned, its overall performance is generalized across a wide range of tasks.
Medusa heads for both Llama 3.1 70B and Llama 3.1 405B were trained using the NVIDIA TensorRT Model Optimizer integration with NVIDIA NeMo framework. The Medusa head training used a frozen backbone, ensuring that use of Medusa yields identical accuracy to the base model.
NVIDIA full-stack innovation never stops
NVIDIA HGX H200 with NVLink Switch and TensorRT-LLM already delivers excellent real-time inference performance on popular and most demanding community models. To continue improving user experiences and reduce inference cost, we relentlessly innovate across every layer of the technology stack – chips, systems, software libraries, algorithms, and more.
We look forward to sharing future updates on our low latency inference performance as both our platform and the LLM ecosystem advances.
This blog is part of a series – view Low Latency Inference Chapter 2: Blackwell is Coming. NVIDIA GH200 NVL32 with NVLink Switch Gives Signs of Big Leap in Time to First Token Performance.