NVIDIA TensorRT LLM enables developers to build high-performance inference engines for large language models (LLMs), but deploying a new architecture traditionally requires significant manual effort. To address this challenge, today we are announcing the availability of AutoDeploy as a beta feature in TensorRT LLM.
AutoDeploy compiles off-the-shelf PyTorch models into inference-optimized graphs. This avoids the need to bake inference-specific optimizations directly into model code, reducing LLM deployment time. AutoDeploy enables the shift from manually reimplementing and optimizing each model toward a compiler-driven workflow that separates model authoring from inference optimization.
This post introduces AutoDeploy architecture and capabilities and shows how it enabled support for recent NVIDIA Nemotron models at launch.
What is AutoDeploy?
Every new LLM architecture comes with its own inference challenges, from transformer models to hybrid vision language models (VLMs) to state space models (SSMs). Turning a reference implementation into a high-performance inference engine typically requires adding KV cache management, sharding weights across GPUs, fusing operations, and tuning the execution graph for specific hardware.
AutoDeploy shifts this workflow toward a compiler-driven approach. Instead of requiring model authors to manually reimplement inference logic, AutoDeploy automatically extracts a computation graph from an off-the-shelf PyTorch model and applies a series of automated transformations to produce an inference-optimized TensorRT LLM graph. This enables you to describe the model once in PyTorch and delegate inference-specific concerns—such as caching, sharding, kernel selection, and runtime integration—to the compiler and runtime.
This approach is particularly well-suited for the long tail of models, including new research architectures, internal variants, and fast-moving open source models, where manual reimplementation is often impractical or unjustified. AutoDeploy enables deployment at launch with competitive baseline performance, while preserving a clear path to incremental optimization as models mature.
AutoDeploy provides:
- Seamless model translation: Automatically converts Hugging Face models into TensorRT LLM graphs without manual rewrites
- Single source of truth: Keeps the original PyTorch model as the canonical definition
- Inference optimization: Applies sharding, quantization, KV cache insertion, attention fusion, CUDA Graphs optimization, and more
- Deployment at launch: Enables immediate deployment with ongoing performance improvements over time
- Turnkey setup: Ships as part of TensorRT LLM with examples and documentation
AutoDeploy can be used for:
- New or experimental architectures: Rapidly deploy research models, hybrid designs, or novel token mixing (attention) mechanisms
- Long-tail model support: Serve internal, fine-tuned, or less common models without bespoke inference implementations
- Fast performance bring-up: Reach competitive baseline performance quickly, then optimize incrementally
- Unified training-to-inference workflow: Keep PyTorch as the model definition while relying on TensorRT LLM for runtime integration
AutoDeploy currently supports more than 100 text‑to‑text LLMs and offers early support for VLMs and SSMs and performance-optimized models such as the Llama model family and NVIDIA Nemotron 3 Nano.
AutoDeploy technical background
AutoDeploy sits between the original Hugging Face model and the TensorRT LLM runtime. The LLM API accepts a model name or checkpoint directory and returns a high‑level LLM object. Under the hood, that object can use AutoDeploy (automated) or a manual backend.
As Figure 1 shows, the AutoDeploy path automatically extracts a graph, applies optimizations, and generates an inference‑optimized graph. The manual path requires engineers to rewrite the model (adding KV cache logic, attention kernels, sharding, kernel fusion, and more) before running it through the same runtime.

Graph capture and pattern matching
AutoDeploy uses the torch.export API to capture the model as a standardized Torch graph consisting of core ATen operations and custom (user- or AutoDeploy-provided) operations. The exported graph then undergoes a series of automated transformations to pattern-match and canonicalize the graph representation of common building blocks.
In this initial step, AutoDeploy ensures that common building blocks such as mixture of experts (MoE), attention, RoPE, or state-space layers are represented using reference implementations that are represented as custom ops and single nodes in the graph.
Figure 2 provides an example of how attention is represented across all models as a single, easy-to-interpret custom operator in PyTorch.

This approach ensures a seamless onboarding process of model support that is decoupled from performance optimization and runtime integration.
Moreover, model onboarding happens on a sliding scale between fully-automated model onboarding through pattern matching and (full) manual rewrites to ensure the final model graph can fully execute the model. The model author can inject custom kernels into the model graph by decorating relevant operations as PyTorch custom operators. The AutoDeploy compiler will not modify the relevant operators (Figure 3).

Sharding, fusion, and performance optimization
In the next stages, AutoDeploy automatically applies performance optimization through compiler-like passes combining fusion passes, performance-tuned recipes, and insertion of optimized kernels into the graph representation. During this stage, the model is also sharded for multi-GPU inference based on available heuristics or prespecified sharding hints reusing the Hugging Face sharding hints.
Flexible attention and caching support
During graph capture and pattern matching, AutoDeploy represents token mixing (for example, attention) operators as simple prefill-only operations expressed as AutoDeploy canonicalized reference operators. This is depicted in Figure 3 for the example of softmax attention.
The system then automatically handles swapping to performance-optimized attention kernels and automatically integrates the caching mechanisms of token mixing operators into the TensorRT LLM optimized cache manager system. Currently, AutoDeploy can handle models that are arbitrarily composed of softmax attention, state-space layers (Mamba2), linear attention (DeltaNet), and causal convolution.
Adding support for other operators with caching follows a strict interface and is easily extendable.
Compilation tooling
AutoDeploy integrates with common off-the-shelf tooling for compiling and lowering the model further, such as torch.compile, integration with CUDA Graphs for fixed batch-size decode-only batches, multistream optimizations, and more.
Runtime integration
AutoDeploy handles all aspects of integrating the model into the optimized TensorRT LLM runtime including features like overlap scheduler, chunked prefill, speculative decoding, or cache and state management without burdening the model author with the intertwined dependencies between the model and the runtime.
AutoDeploy performance example: Nemotron 3 Nano
To gauge AutoDeploy capabilities, the team onboarded NVIDIA Nemotron 3 Nano, a hybrid MoE model. While hand‑tuning such a model for inference would typically take weeks, AutoDeploy enabled onboarding within days, followed by incremental optimizations that performed in line with a manually tuned baseline.
On a single NVIDIA Blackwell DGX B200 GPU, AutoDeploy performed on par with the manually optimized baseline in TensorRT LLM (Figure 4). It delivered up to 350 tokens per second per user throughput and up to 13,000 output tokens per second for latency and high-throughput applications, respectively.

Data was collected for ISL/OSL 1k/1k, TP=1, on NVIDIA DGX B200 using TensorRT LLM v1.3.0rc1, trtllm-serve, and AIPerf benchmarking tool.
To reproduce the results yourself, follow the steps outlined in the NVIDIA Nemotron 3 Nano Checkpoint.
Model onboarding example: Nemotron-Flash
Nemotron-Flash is a representative example of the type of architecture that can be difficult to support using a purely manual inference workflow. This hybrid research model combines multiple token mixers—including state space layers, softmax attention, and linear attention—and would require significant engineering effort to reimplement, optimize, and maintain by hand.
With AutoDeploy, existing optimization passes for Nemotron-Flash layers could be reused out-of-the-box, without any model-specific engineering. New layer types, such as DeltaNet update rule, were integrated as an incremental extension rather than a full rewrite and can be reused for future model onboarding work.
As a result, Nemotron-Flash was onboarded and performance-optimized within days and is now supported out-of-the-box. This highlights the core strength of AutoDeploy: once optimizations are expressed as reusable compiler passes, new and unconventional architectures can immediately benefit from the full optimization stack, dramatically reducing time-to-deployment while maintaining high inference performance.
The team used TensorRT LLM AutoDeploy to benchmark Nemotron Flash 3B Instruct against Qwen2.5 3B Instruct, a widely adopted, heavily hand-tuned model in a similar size range. For the benchmarking scenario in Figure 1 (ISL/OSL=8k/16k), Nemotron-Flash outperforms Qwen2.5 highlighting how novel model architectures can be quickly onboarded to achieve production-ready performance.

Data was collected for ISL/OSL 8k/16k, TP=1, on NVIDIA DGX H100 using TensorRT LLM v1.3.0rc1, trtllm-serve, and AIPerf benchmarking tool.
Get started with TensorRT LLM AutoDeploy
TensorRT LLM AutoDeploy marks a shift toward approaching inference optimization as a compiler and runtime responsibility rather than a burden on the model author. This approach enables faster experimentation, broader model coverage, and a cleaner separation between model design and deployment.
Instead of hand-tuning each model, you can describe the architecture once and let the system apply graph transformations and optimized kernels. Early successes such as Nemotron Nano 3 and Nemotron-Flash demonstrate that deployment at model launch with peak performance is achievable across diverse model architectures.
TensorRT LLM AutoDeploy is rapidly evolving. If you’re interested in experimenting with this feature or contributing to its development, check out the AutoDeploy documentation and example scripts.
Acknowledgments
We’d like to thank those who have contributed to AutoDeploy, including Ajinkya Rasane, Bala Marimuthu, Chenghao Zhang, Chenjie Luo, Eran Geva, Frida Hou, Gal Hubara Agam, Govind Ramnarayan, Grzegorz Kwasniewski, Hao Guo, Jingyu Xin, Joyjit Daw, Karthik Vetrivel, Lucas Liebenwein, Neta Zmora, Suguna Varshini Velury, Suyog Gupta, Tal Cherckez, Taylor Lee, Wanli Jiang, Wei-Ming Chen, William Zhang, and Yoco Xiao.