Agentic AI / Generative AI

NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching

Dec 11, 2024

By Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA TensorRT-LLM now supports encoder-decoder model architectures, expanding its capabilities for optimizing inference across diverse model architectures on NVIDIA GPUs.
The addition of encoder-decoder model support includes in-flight batching and KV cache management, enabling high throughput at low latency, and is supported in the NVIDIA Triton TensorRT-LLM backend for production deployments.
TensorRT-LLM also supports low-rank adaptation (LoRA) for models like BART, allowing for efficient serving of multiple LoRA adapters within a single batch and reduced memory footprint.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA recently announced that NVIDIA TensorRT-LLM now accelerates encoder-decoder model architectures. TensorRT-LLM is an open-source library that optimizes inference for diverse model architectures, including the following:

Decoder-only models, such as Llama 3.1
Mixture-of-experts (MoE) models, such as Mixtral
Selective state-space models (SSM), such as Mamba
Multimodal models for vision-language and video-language applications

The addition of encoder-decoder model support further expands TensorRT-LLM capabilities, providing highly optimized inference for an even broader range of generative AI applications on NVIDIA GPUs.

TensorRT-LLM uses the NVIDIA TensorRT deep learning compiler. It includes the latest optimized kernels for cutting-edge implementations of different attention mechanisms for LLM model execution. It also consists of pre– and post-processing steps and multi-GPU/multi-node communication primitives in a simple, open-source API for groundbreaking LLM inference performance on GPUs.

Addressing nuanced differences in the encoder-decoder model families such as T5, mT5, Flan-T5, BART, mBART, FairSeq NMT, UL2, and Flan-UL2, TensorRT-LLM abstracts the common and derivative components and provides generic support for encoder-decoder models. It also supports multi-GPU/multi-node inference through full tensor parallelism (TP), pipeline parallelism (PP), and a hybrid of the two for these models.

For more information, including different models, different optimizations, and multi-GPU execution, see Encoder-Decoder Model Support.

In-flight batching for encoder-decoder architectures

Encoder-decoder models, similar to multimodal models, have a different runtime pattern as compared to decoder-only models. They have more than one engine (commonly two engines) where the first engine is executed only one time per request with simpler input/output buffers. The second engine is executed auto-regressively with more complex handling logic for key-value (KV) cache management and batch management that provide high throughput at low latency.

There are several key extensions to enable in-flight batching (IFB), also called continuous batching) and KV cache management for encoder-decoder architecture:

Runtime support for encoder models (text, audio, or other modalities) that includes the setup of input/output buffers and model execution.
Dual-paged KV cache management for the decoder’s self-attention cache as well as the decoder’s cross-attention cache computed from the encoder’s output.
Data passing from encoder– to decoder-controlled at the LLM request level. When decoder requests are batched in-flight, each request’s encoder-stage output should be gathered and batched in-flight as well.
Decoupled batching strategy for the encoder and decoder. As encoder and decoder could have different sizes and compute properties, the requests at each stage should be batched independently and asynchronously.

TensorRT-LLM encoder-decoder models are also supported in the NVIDIA Triton TensorRT-LLM backend for production-ready deployments. NVIDIA Triton Inference Server is an open source inference serving software that streamlines AI inferencing.

With the Triton TensorRT-LLM backend, you can take advantage of all the different features to enhance performance and functionality of your encoder-decoder models:

Low-rank adaptation support

Low-rank adaptation (LoRA) is a powerful parameter-efficient fine-tuning (PEFT) technique that enables the customization of LLMs while maintaining impressive performance and minimal resource usage. Instead of updating all model parameters during fine-tuning, LoRA adds small trainable rank decomposition matrices to the model, significantly reducing memory requirements and computational costs.

These LoRA adapters are purpose-tuned for specific downstream applications and can be used to improve model accuracy on the specific task.

The TensorRT-LLM BART LoRA support uses optimization capabilities to efficiently handle the low-rank matrices that characterize LoRA adaptations. This enables the following benefits:

Efficient serving of multiple LoRA adapters within a single batch
Reduced memory footprint through the dynamic loading of LoRA adapters
Seamless integration with existing BART model deployments

Summary

NVIDIA TensorRT-LLM continues to expand its capabilities for optimizing and efficiently running LLMs across different architectures. Upcoming enhancements to encoder-decoder models include FP8 quantization, enabling further improvements in latency and throughput. For production deployments, NVIDIA Triton Inference Server provides the ideal platform for serving these models.

Enterprises seeking the fastest time to value can use NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, which offers optimized inference on popular models from NVIDIA and its partner ecosystem.

Discuss (0)

About the Authors

About Anjali Shah
Anjali Shah is a senior deep learning scientist at NVIDIA within the Developer Advocate Engineering group helping clients build generative AI solutions. Early in her career, as a software engineer, she built mission-critical platforms for the world's leading financial services firms. She then spent several years in the healthcare sector, architecting and implementing large scale healthcare (EHR) systems. Before joining NVIDIA, she spent several years at a leading tech company, working across different industries helping clients build innovative data and AI solutions. She has a Ph.D. in biomedical informatics and applied statistics and an M.S. and B.S. in computer science and engineering.

View all posts by Anjali Shah

About Kshitiz Gupta
Kshitiz Gupta is a deep learning solutions architect at NVIDIA helping customers solve their business problems with GPU-accelerated AI across different industries. Kshitiz earned his bachelor's degree from UC Berkeley where he studied statistics and machine learning.

View all posts by Kshitiz Gupta

About Jiahong Liu
Jiahong Liu is a solution architect on the NVIDIA Cloud Service Provider team, where he helps customers adopt ML and AI solutions with better utilization of the NVIDIA GPU to solve their business challenges.

View all posts by Jiahong Liu

About Haohang Huang
Haohang Huang is a senior AI developer technology engineer at NVIDIA. He works on accelerating GenAI applications on GPUs, with the focus on computer vision and large language models. He received his Ph.D. from University of Illinois Urbana-Champaign.

View all posts by Haohang Huang