NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following:
- Decoder-only models, such as Llama 3.1
- Mixture-of-experts (MoE) models, such as Mixtral
- Selective state-space models (SSM), such as Mamba
- Multimodal models for vision-language and video-language applications
The addition of encoder-decoder model support further expands TensorRT-LLM capabilities, providing highly optimized inference for an even broader range of generative AI applications on NVIDIA GPUs.
TensorRT-LLM uses the NVIDIA TensorRT deep learning compiler. It includes the latest optimized kernels for cutting-edge implementations of different attention mechanisms for LLM model execution. It also consists of pre– and post-processing steps and multi-GPU/multi-node communication primitives in a simple, open-source API for groundbreaking LLM inference performance on GPUs.
Addressing nuanced differences in the encoder-decoder model families such as T5, mT5, Flan-T5, BART, mBART, FairSeq NMT, UL2, and Flan-UL2, TensorRT-LLM abstracts the common and derivative components and provides generic support for encoder-decoder models. It also supports multi-GPU/multi-node inference through full tensor parallelism (TP), pipeline parallelism (PP), and a hybrid of the two for these models.
For more information, including different models, different optimizations, and multi-GPU execution, see Encoder-Decoder Model Support.
In-flight batching for encoder-decoder architectures
Encoder-decoder models, similar to multimodal models, have a different runtime pattern as compared to decoder-only models. They have more than one engine (commonly two engines) where the first engine is executed only one time per request with simpler input/output buffers. The second engine is executed auto-regressively with more complex handling logic for key-value (KV) cache management and batch management that provide high throughput at low latency.
There are several key extensions to enable in-flight batching (IFB), also called continuous batching) and KV cache management for encoder-decoder architecture:
- Runtime support for encoder models (text, audio, or other modalities) that includes the setup of input/output buffers and model execution.
- Dual-paged KV cache management for the decoder’s self-attention cache as well as the decoder’s cross-attention cache computed from the encoder’s output.
- Data passing from encoder– to decoder-controlled at the LLM request level. When decoder requests are batched in-flight, each request’s encoder-stage output should be gathered and batched in-flight as well.
- Decoupled batching strategy for the encoder and decoder. As encoder and decoder could have different sizes and compute properties, the requests at each stage should be batched independently and asynchronously.
TensorRT-LLM encoder-decoder models are also supported in the NVIDIA Triton TensorRT-LLM backend for production-ready deployments. NVIDIA Triton Inference Server is an open source inference serving software that streamlines AI inferencing.
With the Triton TensorRT-LLM backend, you can take advantage of all the different features to enhance performance and functionality of your encoder-decoder models:
Low-rank adaptation support
Low-rank adaptation (LoRA) is a powerful parameter-efficient fine-tuning (PEFT) technique that enables the customization of LLMs while maintaining impressive performance and minimal resource usage. Instead of updating all model parameters during fine-tuning, LoRA adds small trainable rank decomposition matrices to the model, significantly reducing memory requirements and computational costs.
These LoRA adapters are purpose-tuned for specific downstream applications and can be used to improve model accuracy on the specific task.
The TensorRT-LLM BART LoRA support uses optimization capabilities to efficiently handle the low-rank matrices that characterize LoRA adaptations. This enables the following benefits:
- Efficient serving of multiple LoRA adapters within a single batch
- Reduced memory footprint through the dynamic loading of LoRA adapters
- Seamless integration with existing BART model deployments
Summary
NVIDIA TensorRT-LLM continues to expand its capabilities for optimizing and efficiently running LLMs across different architectures. Upcoming enhancements to encoder-decoder models include FP8 quantization, enabling further improvements in latency and throughput. For production deployments, NVIDIA Triton Inference Server provides the ideal platform for serving these models.
Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers optimized inference on popular models from NVIDIA and its partner ecosystem.