Large language models (LLMs) offer incredible new capabilities, expanding the frontier of what is possible with AI. However, their large size and unique execution characteristics can make them difficult to use in cost-effective ways.
NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML (now a part of Databricks), OctoML, Perplexity, Tabnine, and Together AI, to accelerate and optimize LLM inference.
Those innovations have been integrated into the open-source NVIDIA TensorRT-LLM software, available for NVIDIA Ampere, NVIDIA Lovelace, and NVIDIA Hopper GPUs. TensorRT-LLM consists of the TensorRT deep learning compiler and includes optimized kernels, pre– and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs. It enables you to experiment with new LLMs, offering peak performance and quick customization capabilities, without requiring a deep knowledge of C++ or NVIDIA CUDA.
TensorRT-LLM improves ease of use and extensibility through an open-source modular Python API for defining, optimizing, and executing new architectures and enhancements as LLMs evolve, and can be customized easily.
For example, MosaicML has seamlessly added specific features that it needs on top of TensorRT-LLM and integrated them into inference serving. Naveen Rao, vice president of engineering at Databricks, said, “It has been an absolute breeze.”
“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantization, and more, and is efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”
Summarizing articles is just one of the many applications of LLMs. The following benchmarks show performance improvements brought by TensorRT-LLM on the latest NVIDIA Hopper architecture.
The following figures reflect article summarization using an NVIDIA A100 and NVIDIA H100 GPUs with CNN/Daily Mail, a well-known dataset for evaluating summarization performance.
In Figure 1, the NVIDIA H100 GPU alone is 4x faster than the A100 GPU. Adding TensorRT-LLM and its benefits, including in-flight batching, results in an 8x total increase to deliver the highest throughput.
Text summarization, variable I/O length, CNN / DailyMail dataset | A100 FP16 PyTorch eager mode | H100 FP8 | H100 FP8, in-flight batching, TensorRT-LLM
On Llama 2—a popular language model released recently by Meta and used widely by organizations looking to incorporate generative AI—TensorRT-LLM can accelerate inference performance by 4.6x compared to A100 GPUs.
Text summarization, variable I/O length, CNN / DailyMail dataset | A100 FP16 PyTorch eager mode| H100 FP8 | H100 FP8, in-flight batching, TensorRT-LLM
TCO and energy efficiency improvements
Minimizing total cost of ownership (TCO) and energy consumption in the data center are key goals for customers adopting AI and LLMs in particular, given their explosive increase in computational requirements. Customers don’t just look at the cost of a single server when it comes to AI platform expenditures. Rather, they have to look at aggregate capital and operational costs:
- Cost of GPU servers
- Management head nodes (CPU servers to coordinate all the GPU servers)
- Networking equipment (fabric, Ethernet, and cabling)
- Data center IT staff and software
- Equipment maintenance
- Data center rent and electricity
Taken at a holistic level of the actual costs incurred by a data center, significant performance speedups reduce the equipment and maintenance requirements, leading to sizable capital and operational expense savings.
Figure 3 shows that an 8x performance speedup on small language models like GPT-J 6B leads to a 5.3x reduction in TCO and a 5.6x reduction in energy (electricity bill savings) over the A100 GPU baseline.
Similarly, on state-of-the-art LLMs like Llama2, even with 70B parameters, you can realize a 4.6x performance speedup, which results in a 3x reduction in TCO and a 3.2x reduction in energy consumed compared to the A100 baseline.
In addition to TCO, there are substantial labor costs associated with software development that can easily exceed infrastructure costs themselves. Investments made by NVIDIA in TensorRT, TensortRT-LLM, Triton Inference Server, and the NVIDIA NeMo framework save you a great deal of time as well as reduce time to market. You must factor in these labor costs, which can easily exceed capital and operational costs, to develop a true picture of your aggregate AI expenditures.
LLM ecosystem explosion
The ecosystem is innovating rapidly, developing new and diverse model architectures. Larger models unleash new capabilities and use cases. Some of the largest, most advanced language models, like Meta’s 70B-parameter Llama 2, require multiple GPUs working in concert to deliver responses in real time. Previously, developers looking to achieve the best performance for LLM inference had to rewrite and manually split the AI model into fragments and coordinate execution across GPUs.
TensorRT-LLM uses tensor parallelism, a type of model parallelism in which individual weight matrices are split across devices. This enables efficient inference at scale—with each model running in parallel across multiple GPUs connected through NVLink and across multiple servers—without developer intervention or model changes.
As new models and model architectures are introduced, you can optimize models with the latest NVIDIA AI kernels available as open source in TensorRT-LLM. The supported kernel fusions include cutting-edge implementations of
FlashAttention and masked multi-head attention for the context and generation phases of GPT model execution, along with many others.
TensorRT-LLM also includes fully optimized, ready-to-run versions of many LLMs widely used in production today, all of which can be implemented with the simple-to-use TensorRT-LLM Python API:
- Meta Llama 2
- OpenAI GPT-2 and GPT-3
- Mosaic MPT
- …and a dozen others
These capabilities help you create customized LLMs faster and more accurately to meet the needs of virtually any industry.
Today’s LLMs are extremely versatile. A single model can be used simultaneously for a variety of different tasks. From a simple question-and-answer response in a chatbot to the summarization of a document or the generation of a long chunk of code, workloads are highly dynamic, with outputs varying in size by several orders of magnitude.
This versatility can make it difficult to batch requests and execute them in parallel effectively—a common optimization for serving neural networks, which could result in some requests finishing much earlier than others.
To manage these dynamic loads, TensorRT-LLM includes an optimized scheduling technique called in-flight batching. This takes advantage of the fact that the overall text generation process for an LLM can be broken down into multiple iterations of execution on the model.
With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch. It then begins executing new requests while other requests are still in flight.
In-flight batching and the additional kernel-level optimizations enable improved GPU usage and minimally double the throughput on a benchmark of real-world LLM requests on NVIDIA H100 Tensor Core GPUs, helping to reduce energy costs and minimize TCO.
H100 Transformer Engine with FP8
LLMs contain billions of model weights and activations, typically trained and represented with 16-bit floating point (FP16 or BF16) values where each value occupies 16 bits of memory. At inference time, however, most models can be effectively represented at lower precision, like 8-bit or even 4-bit integers (INT8 or INT4), using modern quantization techniques.
Quantization is the process of reducing the precision of a model’s weights and activations without sacrificing accuracy. Using lower precision means that each parameter is smaller, and the model takes up less space in GPU memory. This enables inference on larger models with the same hardware while spending less time on memory operations during execution.
NVIDIA H100 GPUs with TensorRT-LLM give you the ability to convert model weights into a new FP8 format easily and compile models to take advantage of optimized FP8 kernels automatically. This is made possible through NVIDIA Hopper Transformer Engine technology and done without having to change any model code.
The FP8 data format introduced by the H100 enables you to quantize your models and radically reduce memory consumption without degrading model accuracy. FP8 quantization retains higher accuracy compared to other data formats like INT8 or INT4 while achieving the fastest performance and offering the simplest implementation.
LLMs are advancing rapidly. Diverse model architectures are being developed daily and contribute to a growing ecosystem. In turn, larger models unleash new capabilities and use cases, driving adoption across all industries.
LLM inference is also reshaping the data center. Higher performance with increased accuracy yields better TCO for enterprises. Model innovations enable better customer experiences, translating into higher revenue and earnings.
When planning inference deployment projects, there are still many other considerations to achieve peak performance using state-of-the-art LLMs. Optimization rarely happens automatically. You must consider fine-tuning factors such as parallelism, end-to-end pipelines, and advanced scheduling techniques. Those factors require a computing platform that can handle mixed precision without diminishing accuracy.
Get started with TensorRT-LLM
NVIDIA TensorRT-LLM is now available as the open-source library /NVIDIA/TensorRT-LLM on GitHub and the NVIDIA NeMo framework—part of NVIDIA AI Enterprise, an enterprise-grade AI software platform with security, stability, manageability, and support.