NVIDIA TensorRT LLM
NVIDIA TensorRT™ LLM is an open-source library built to deliver high-performance, real-time inference optimization for large language models (LLMs) on NVIDIA GPUs—whether on a desktop or in a data center. It includes a modular Python runtime, PyTorch-native model authoring, and a stable production API. Specifically customized for NVIDIA platforms, TensorRT LLM helps developers maximize inference performance to serve more users in parallel, while minimizing operational costs and delivering blazingly fast experiences.
Download at NVIDIA NGCDownload on GithubRead the Quick-Start Guide
How TensorRT LLM Works
The latest TensorRT LLM architecture is purpose-built to streamline the developer experience—enabling faster iteration and smoother deployment without sacrificing its industry-leading inference performance. The architecture provides easy-to-use Python APIs, a simple CLI, PyTorch model authorship, and an extensible Python framework to enable innovation.
Optimized for peak performance on NVIDIA platforms, TensorRT LLM leverages deep hardware-software integration to deliver unmatched efficiency and speed for LLM inference. Kernels specially designed for NVIDIA hardware achieve peak performance for common LLM inference operations, and runtime optimizations drive GPU utilization and end-user response speeds. Key optimizations include FP8 and NVFP4 quantization, disaggregated serving, parallelization techniques, including wide expert parallelism (EP), and advanced speculative decoding techniques, including EAGLE-3 and multi-token prediction.
Get Started with TensorRT LLM
Installation Guides
Install TensorRT LLM on Linux by using pip install or by building from source.
Get TensorRT LLM Containers
Containers freely available on NVIDIA NGC™ make it easy to build with TensorRT LLM in a cloud environment.
API Quick-Start Guide
Quickly get set up and start optimizing inference with the LLM API.
Key Features
Modular Runtime Built in Python
TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also predefined and can be customized using native PyTorch code, making it easy to adapt the system to specific needs.
PyTorch-Based Model Authoring for Stable LLM API
Architected on PyTorch, TensorRT LLM provides a high-level Python LLM API that supports a wide range of inference setups—from single-GPU to multi-GPU and multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo.
State-of-the-Art Optimizations
TensorRT LLM provides state-of-the-art optimizations, including custom attention kernels, in-flight batching, paged key-value (KV) caching, quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.
Starter Kits
Accelerated Computing Hub
Optimizing LLMs Serving With the New NVIDIA TensorRT LLM Container on Google Vertex AI (Google Cloud Article)
Benchmark and Performance Tune LLMs
Performance Tuning Guide (GitHub)
trtllm-bench documentation in the TensorRT LLM Benchmarking page (GitHub)
Performance Analysis (GitHub) to use NVIDIA Nsight™ Systems to profile model execution
How to Run TensorRT LLM Tests (GitHub)
Optimize and Deploy Custom LLMs
Building From Source Code on Linux (GitHub)
Learning Library
Ecosystem
TensorRT LLM is being widely adopted across industries.
More Resources
Ethical AI
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.
Get started with TensorRT LLM today.