Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires significant manual effort. To address this challenge, today we are announcing the availability of AutoDeploy as a beta feature in TensorRT LLM.

AutoDeploy compiles off-the-shelf PyTorch models into inference-optimized graphs. This avoids the need to bake inference-specific optimizations directly into model code, reducing LLM deployment time. AutoDeploy enables the shift from manually reimplementing and optimizing each model toward a compiler-driven workflow that separates model authoring from inference optimization.

This post introduces AutoDeploy architecture and capabilities and shows how it enabled support for recent NVIDIA Nemotron models at launch.

What is AutoDeploy?

Every new LLM architecture comes with its own inference challenges, from transformer models to hybrid vision language models (VLMs) to state space models (SSMs). Turning a reference implementation into a high-performance inference engine typically requires adding KV cache management, sharding weights across GPUs, fusing operations, and tuning the execution graph for specific hardware.

AutoDeploy shifts this workflow toward a compiler-driven approach. Instead of requiring model authors to manually reimplement inference logic, AutoDeploy automatically extracts a computation graph from an off-the-shelf PyTorch model and applies a series of automated transformations to produce an inference-optimized TensorRT LLM graph. This enables you to describe the model once in PyTorch and delegate inference-specific concerns—such as caching, sharding, kernel selection, and runtime integration—to the compiler and runtime.

This approach is particularly well-suited for the long tail of models, including new research architectures, internal variants, and fast-moving open source models, where manual reimplementation is often impractical or unjustified. AutoDeploy enables deployment at launch with competitive baseline performance, while preserving a clear path to incremental optimization as models mature.

AutoDeploy provides:

Seamless model translation: Automatically converts Hugging Face models into TensorRT LLM graphs without manual rewrites
Single source of truth: Keeps the original PyTorch model as the canonical definition
Inference optimization: Applies sharding, quantization, KV cache insertion, attention fusion, CUDA Graphs optimization, and more
Deployment at launch: Enables immediate deployment with ongoing performance improvements over time
Turnkey setup: Ships as part of TensorRT LLM with examples and documentation

AutoDeploy can be used for:

New or experimental architectures: Rapidly deploy research models, hybrid designs, or novel token mixing (attention) mechanisms
Long-tail model support: Serve internal, fine-tuned, or less common models without bespoke inference implementations
Fast performance bring-up: Reach competitive baseline performance quickly, then optimize incrementally
Unified training-to-inference workflow: Keep PyTorch as the model definition while relying on TensorRT LLM for runtime integration

AutoDeploy currently supports more than 100 text‑to‑text LLMs and offers early support for VLMs and SSMs and performance-optimized models such as the Llama model family and NVIDIA Nemotron 3 Nano.

AutoDeploy technical background

AutoDeploy sits between the original Hugging Face model and the TensorRT LLM runtime. The LLM API accepts a model name or checkpoint directory and returns a high‑level LLM object. Under the hood, that object can use AutoDeploy (automated) or a manual backend.

As Figure 1 shows, the AutoDeploy path automatically extracts a graph, applies optimizations, and generates an inference‑optimized graph. The manual path requires engineers to rewrite the model (adding KV cache logic, attention kernels, sharding, kernel fusion, and more) before running it through the same runtime.

Diagram showing the integration of AutoDeploy with the TensorRT-LLM runtime. A Hugging Face Transformers model connects through an LLM API, which branches into two paths: AutoDeploy mode and Manual mode. In AutoDeploy mode, the model is converted into an export IR graph, passed through automated model optimizations (such as sharding, KV cache fusion, and related transformations), then through deployment optimizations (including torch.compile and CUDA graph capture) to produce an inference-optimized graph. In Manual mode, users explicitly rewrite the model to specify features such as KV caching, inference attention, quantization, and sharding. Both paths rely on Torch custom operator libraries (rope, attention, GEMM, and all-reduce) and ultimately feed into the TensorRT-LLM runtime. — *Figure 1. Overview of the AutoDeploy mode and the integration into the TensorRT LLM runtime*

Graph capture and pattern matching

AutoDeploy uses the torch.export API to capture the model as a standardized Torch graph consisting of core ATen operations and custom (user- or AutoDeploy-provided) operations. The exported graph then undergoes a series of automated transformations to pattern-match and canonicalize the graph representation of common building blocks.

In this initial step, AutoDeploy ensures that common building blocks such as mixture of experts (MoE), attention, RoPE, or state-space layers are represented using reference implementations that are represented as custom ops and single nodes in the graph.

Figure 2 provides an example of how attention is represented across all models as a single, easy-to-interpret custom operator in PyTorch.

Side-by-side diagram comparing two computation graphs for attention. On the left, the “Original graph” shows separate operations—Q, K, V inputs flowing through transpose, matrix multiplications, scaling, softmax, and a final matmul. On the right, the “Canonicalized representation” shows the same Q, K, V inputs mapped into a single fused torch.ops.auto_deploy.torch_attention operator. The figure illustrates how AutoDeploy replaces diverse attention implementations with a unified canonical form to enable simpler caching and kernel selection. — Figure 2. AutoDeploy ensures that canonicalized representations are used for common building blocks, such as attention, to simplify downstream performance optimizations, such as caching and kernel selection

This approach ensures a seamless onboarding process of model support that is decoupled from performance optimization and runtime integration.

Moreover, model onboarding happens on a sliding scale between fully-automated model onboarding through pattern matching and (full) manual rewrites to ensure the final model graph can fully execute the model. The model author can inject custom kernels into the model graph by decorating relevant operations as PyTorch custom operators. The AutoDeploy compiler will not modify the relevant operators (Figure 3).

Side-by-side diagram comparing two computation graphs. On the left, the “Original graph” shows separate attention operations, including custom normalization ops on Q and K, a transpose, two matmul operations, scaling by 1/√h_d, and a softmax. On the right, the “Canonicalized representation” shows the same Q, K, and V inputs with custom normalization, but the attention pattern is collapsed into a single fused operator labeled torch.ops.auto_deploy.torch_attention, illustrating how AutoDeploy injects custom operators to replace multiple low-level ops with a canonical attention operator. — *Figure 3. An example of injecting custom operators into the AutoDeploy model graph*

Sharding, fusion, and performance optimization

In the next stages, AutoDeploy automatically applies performance optimization through compiler-like passes combining fusion passes, performance-tuned recipes, and insertion of optimized kernels into the graph representation. During this stage, the model is also sharded for multi-GPU inference based on available heuristics or prespecified sharding hints reusing the Hugging Face sharding hints.

Flexible attention and caching support

During graph capture and pattern matching, AutoDeploy represents token mixing (for example, attention) operators as simple prefill-only operations expressed as AutoDeploy canonicalized reference operators. This is depicted in Figure 3 for the example of softmax attention.

The system then automatically handles swapping to performance-optimized attention kernels and automatically integrates the caching mechanisms of token mixing operators into the TensorRT LLM optimized cache manager system. Currently, AutoDeploy can handle models that are arbitrarily composed of softmax attention, state-space layers (Mamba2), linear attention (DeltaNet), and causal convolution.

Adding support for other operators with caching follows a strict interface and is easily extendable.

Compilation tooling

AutoDeploy integrates with common off-the-shelf tooling for compiling and lowering the model further, such as torch.compile, integration with CUDA Graphs for fixed batch-size decode-only batches, multistream optimizations, and more.

Runtime integration

AutoDeploy handles all aspects of integrating the model into the optimized TensorRT LLM runtime including features like overlap scheduler, chunked prefill, speculative decoding, or cache and state management without burdening the model author with the intertwined dependencies between the model and the runtime.

AutoDeploy performance example: Nemotron 3 Nano

To gauge AutoDeploy capabilities, the team onboarded NVIDIA Nemotron 3 Nano, a hybrid MoE model. While hand‑tuning such a model for inference would typically take weeks, AutoDeploy enabled onboarding within days, followed by incremental optimizations that performed in line with a manually tuned baseline.

On a single NVIDIA Blackwell DGX B200 GPU, AutoDeploy performed on par with the manually optimized baseline in TensorRT LLM (Figure 4). It delivered up to 350 tokens per second per user throughput and up to 13,000 output tokens per second for latency and high-throughput applications, respectively.

Line chart comparing online inference performance of TensorRT-LLM using the AutoDeploy backend versus the manual PyTorch backend for Nemotron 3 Nano in FP8 on an NVIDIA B200 GPU. The x-axis shows tokens per second per user (approximately 40 to 350), and the y-axis shows output tokens per second per GPU (up to about 15,000). Both curves decrease as per-user throughput increases. AutoDeploy generally matches or exceeds the PyTorch backend at medium to high tokens/s per user, while PyTorch shows slightly higher throughput at the lowest concurrency points. — *Figure 4. Online performance comparison between the current default PyTorch (manual) backend and AutoDeploy backend in TensorRT LLM for Nemotron 3 Nano FP8*

Data was collected for ISL/OSL 1k/1k, TP=1, on NVIDIA DGX B200 using TensorRT LLM v1.3.0rc1, trtllm-serve, and AIPerf benchmarking tool.

To reproduce the results yourself, follow the steps outlined in the NVIDIA Nemotron 3 Nano Checkpoint.

Model onboarding example: Nemotron-Flash

Nemotron-Flash is a representative example of the type of architecture that can be difficult to support using a purely manual inference workflow. This hybrid research model combines multiple token mixers—including state space layers, softmax attention, and linear attention—and would require significant engineering effort to reimplement, optimize, and maintain by hand.

With AutoDeploy, existing optimization passes for Nemotron-Flash layers could be reused out-of-the-box, without any model-specific engineering. New layer types, such as DeltaNet update rule, were integrated as an incremental extension rather than a full rewrite and can be reused for future model onboarding work.

As a result, Nemotron-Flash was onboarded and performance-optimized within days and is now supported out-of-the-box. This highlights the core strength of AutoDeploy: once optimizations are expressed as reusable compiler passes, new and unconventional architectures can immediately benefit from the full optimization stack, dramatically reducing time-to-deployment while maintaining high inference performance.

The team used TensorRT LLM AutoDeploy to benchmark Nemotron Flash 3B Instruct against Qwen2.5 3B Instruct, a widely adopted, heavily hand-tuned model in a similar size range. For the benchmarking scenario in Figure 1 (ISL/OSL=8k/16k), Nemotron-Flash outperforms Qwen2.5 highlighting how novel model architectures can be quickly onboarded to achieve production-ready performance.

Line chart comparing output token throughput versus mean user throughput for Nemotron-Flash-3B and Qwen2.5 3B at ISL=8k and OSL=16k across increasing concurrencies. Nemotron-Flash-3B (green) consistently achieves much higher output throughput than Qwen2.5 3B (orange) at all user throughputs, maintaining a roughly linear decline as mean user throughput increases. Qwen2.5 3B shows substantially lower output throughput and degrades more quickly with higher user throughput, highlighting the large performance advantage of Nemotron-Flash-3B under concurrent load. — *Figure 5. Throughput latency trade-off curve comparing Nemotron Flash 3B and Qwen2.5 3B in AutoDeploy*

Data was collected for ISL/OSL 8k/16k, TP=1, on NVIDIA DGX H100 using TensorRT LLM v1.3.0rc1, trtllm-serve, and AIPerf benchmarking tool.

Get started with TensorRT LLM AutoDeploy

TensorRT LLM AutoDeploy marks a shift toward approaching inference optimization as a compiler and runtime responsibility rather than a burden on the model author. This approach enables faster experimentation, broader model coverage, and a cleaner separation between model design and deployment.

Instead of hand-tuning each model, you can describe the architecture once and let the system apply graph transformations and optimized kernels. Early successes such as Nemotron Nano 3 and Nemotron-Flash demonstrate that deployment at model launch with peak performance is achievable across diverse model architectures.

TensorRT LLM AutoDeploy is rapidly evolving. If you’re interested in experimenting with this feature or contributing to its development, check out the AutoDeploy documentation and example scripts.

Acknowledgments

We’d like to thank those who have contributed to AutoDeploy, including Ajinkya Rasane, Bala Marimuthu, Chenghao Zhang, Chenjie Luo, Eran Geva, Frida Hou, Gal Hubara Agam, Govind Ramnarayan, Grzegorz Kwasniewski, Hao Guo, Jingyu Xin, Joyjit Daw, Karthik Vetrivel, Lucas Liebenwein, Neta Zmora, Suguna Varshini Velury, Suyog Gupta, Tal Cherckez, Taylor Lee, Wanli Jiang, Wei-Ming Chen, William Zhang, and Yoco Xiao.