As large language models (LLMs) continue to grow in size and complexity, the performance requirements for serving them quickly and cost-effectively continue to grow. Delivering high LLM inference performance requires an efficient parallel computing architecture and a flexible and highly optimized software stack.
Recently, NVIDIA Hopper GPUs running NVIDIA TensorRT-LLM inference software set new LLM performance records on industry-standard, peer-reviewed MLPerf Inference v4.0 benchmarks, demonstrating the capabilities of the NVIDIA full-stack inference platform.
Recently, LLMs based on a mixture-of-experts (MoE) architecture have emerged, offering potential advantages in model capacity, training cost, and first-token serving latency compared to LLMs employing dense architectures. The popular Mixtral 8x7B open-weights model developed by Mistral Al employs an MoE architecture and has shown impressive capabilities.
In this post, we show how NVIDIA H100 Tensor Core GPUs, based on the NVIDIA Hopper GPU architecture, and TensorRT-LLM software deliver outstanding performance on Mixtral 8x7B.
Mixtral 8x7B performance with NVIDIA H100 GPUs and TensorRT-LLM
When deploying LLMs at scale, it’s common for cloud services to set query response time targets and then seek to optimize the number of user queries that can be served in parallel within those constraints, by grouping them into batches. TensorRT-LLM supports in-flight batching, which enables completed requests to be replaced with new requests during LLM serving and helps to improve performance.
The process of selecting a response time budget requires a careful balancing of throughput and user interactivity, as increases in one translate into reductions in the other. Plots of throughput to latency can be helpful tools in selecting the optimal deployment scenario.
Often, there’s a steep part of the throughput to latency curve, during which large improvements in throughput can be had for just small increases in response time. For production deployments, choosing latency targets within this window can yield great user experiences at a relatively low deployment cost.
H100 throughput vs. response latency
Figure 1 plots throughput, in terms of total requests processed per second, against the time to generate a response to each request, using two H100 SXM GPUs running TensorRT-LLM software using both FP16 precision as well as FP8 precision.
Mixtral 8x7B results measured using Tensor Parallel=2 running TensorRT-LLM v0.10 and CUDA compilation tools, release 12.4, v12.4.131, using two NVIDIA H100 SXM GPUs. Average ISL = 573, and average OSL = 50.
The NVIDIA Hopper architecture is equipped with fourth-generation Tensor Cores, which support FP8 data type at twice the peak computational rate compared to either FP16 or BF16. TensorRT-LLM software provides support for FP8 quantization, enabling you to convert model weights into FP8 and automatically use highly tuned FP8 kernels.
The performance benefits of FP8 are significant, enabling the H100 GPU to deliver nearly 50% more throughput within a response limit of 0.5 seconds. With the additional performance provided by FP8, you can increase throughput, decreasing cost, while maintaining the same user experience. Or, for a given throughput, you can enjoy even faster response times, improving the user experience for about the same cost.
H100 throughput vs. mean time per output token
Figure 2 shows the performance of H100 GPUs and TensorRT-LLM when running in streaming mode. In this use case, rather than wait for the full inference request to process and then report total latency, results are reported back as soon as an output token is produced. This makes it possible to record the time taken per output token rather than the time required for the entire request.
Mixtral 8x7B results measured using Tensor Parallel=2 running TensorRT-LLM v0.10 and CUDA compilation tools, release 12.4, v12.4.131, using two NVIDIA H100 SXM GPUs. Average ISL = 573, and average OSL = 50.
NVIDIA H100 GPUs and TensorRT-LLM software also deliver great performance in streaming mode, achieving high throughput even with a low average time per output token. At a mean time per output token of just 0.016 seconds—or more than 60 tokens per second flying across the screen for each user—a pair of H100 GPUs running TensorRT-LLM with FP8 precision achieves a high throughput of 38.4 requests per second.
Again, by using FP8, you can either improve responsiveness for a given deployment cost or increase the number of users that can be served at a given level of responsiveness, reducing cost.
H100 throughput without latency constraints
Finally, to measure performance in latency-unconstrained scenarios, we provide inference throughput without latency constraints. While online scenarios are more popular for real-time use cases, offline scenarios, like data labeling, sentiment analysis, or summarization, can be a good measure of the peak achievable throughput of a platform.
Figure 3 shows offline throughput at various batch sizes, using input and output sequence lengths of 128.
Mixtral 8x7B results measured using Tensor Parallel=2 running TensorRT-LLM v0.10 and CUDA compilation tools, release 12.4, v12.4.131, using two NVIDIA H100 SXM GPUs. FP16 and FP8 precisions. ISL = 128, OSL = 128.
As the batch size increases, the workload becomes increasingly compute-intensive, amplifying the benefits of the greater FP8 throughput of the NVIDIA Hopper architecture. The use of FP8 also reduces memory footprint, enabling even larger batches to be processed. At a batch size of 1,024, inference throughput reaches nearly 21,000 tokens/second with FP8.
NVIDIA TensorRT-LLM plus Mixtral 8x7B
TensorRT-LLM is an open-source library for optimizing inference for LLMs such as Mixtral. It provides the latest performance optimizations for the most popular LLMs in a simple, open-source Python API. This includes general LLM optimizations like optimized attention kernels, KV caching techniques, and FP8 or INT4 AWQ quantization without sacrificing accuracy.
Mixtral deployed with TensorRT-LLM has custom techniques specifically for MoEs, including expert parallelism (EP) and optimized expert kernels. For maximum GPU utilization and workload balancing, TensorRT-LLM also supports a hybrid of expert and tensor parallelism for the Mixtral MoE model.
Mixtral with TensorRT-LLM can be hosted with NVIDIA Triton Inference Server software. For more information about supported models, features, and optimizations, see the TensorRT-LLM GitHub repo.
MoE improves accuracy, generalization, and scalability
MoE combines the outputs of specialized experts, or various models, together. This combination of experts improves accuracy, generalization, and scalability.
Expert accuracy
Each expert is trained on a specific dataset and skill. Prompt tokens are dynamically routed or gated to specific experts. Having each expert trained on a specific dataset or subset of data increases accuracy in that specific domain. For example, one expert can focus on code completion, another expert can focus on mathematics, and yet another expert can focus on grammar and language semantics. Experts can be as specific as possible.
Ensembled generalization
Combining expert knowledge helps improve generalization. Each expert can provide its own response to a prompt, providing its own strength and specificity. The MoE architecture weights the responses of each expert based on relevancy to the prompt. Having specific experts improves accuracy and fit while weight-averaging the experts improves generalization. The inferential outputs are weight-averaged to produce the final output.
Sparsity for scalability
Mixtral is a sparse MoE (SMoE) LLM. In an SMoE, only a subset of experts are activated for each input. Reducing the number of experts per prompt increases computational efficiency. TensorRT-LLM further reduces latency and increases throughput, as shown at the beginning of the post.
In summary, specific experts improve accuracy, weight-averaging experts improve generalization, and sparsely selecting experts improve scalability.
To quote the Mixtral 8 paper:
“We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e., experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference.”
Key takeaways
NVIDIA Hopper GPUs, running TensorRT-LLM, deliver outstanding inference performance for the latest LLMs, including MoE models like Mixtral 8x7B. NVIDIA also continues to optimize its software stack, delivering both continuous performance gains as well as rapid support for the latest models. This helps to minimize the total cost of ownership and increase return on investment.
In addition to continuous software advances, NVIDIA is quickly innovating across silicon and systems, providing customers with even more performance. Products based on the groundbreaking NVIDIA Blackwell architecture will be available from partners later this year. NVIDIA GB200 NVL72, which combines 36 NVIDIA Grace CPUs with 72 NVIDIA Blackwell GPUs in a rack-scale architecture, will deliver large speedups for real-time 1.8T-parameter MoE LLM inference.
Try Mixtral with NVIDIA TensorRT-LLM
For more information about a sample Python script to try Mixtral with TensorRT-LLM, see the /NVIDIA/TensorRT-LLM GitHub repo. It includes downloading the weights and building the engine and showcases the TensorRT-LLM features Parallelism, Normalization, Quantization, and FP8 Post-Training Quantization.
Acknowledgments
We would like to thank Bryce Long and Flora Tasse on the inference benchmarking team who contributed to this post.