NVIDIA TensorRT LLM

NVIDIA TensorRT™ LLM is an open-source library built to deliver high-performance, real-time inference optimization for large language models (LLMs) on NVIDIA GPUs—whether on a desktop or in a data center. It includes a modular Python runtime, PyTorch-native model authoring, and a stable production API. Specifically customized for NVIDIA platforms, TensorRT LLM helps developers maximize inference performance to serve more users in parallel, while minimizing operational costs and delivering blazingly fast experiences.

Download at NVIDIA NGC Download on Github Read the Quick-Start Guide

How TensorRT LLM Works

The latest TensorRT LLM architecture is purpose-built to streamline the developer experience—enabling faster iteration and smoother deployment without sacrificing its industry-leading inference performance. The architecture provides easy-to-use Python APIs, a simple CLI, PyTorch model authorship, and an extensible Python framework to enable innovation.

Optimized for peak performance on NVIDIA platforms, TensorRT LLM leverages deep hardware-software integration to deliver unmatched efficiency and speed for LLM inference. Kernels specially designed for NVIDIA hardware achieve peak performance for common LLM inference operations, and runtime optimizations drive GPU utilization and end-user response speeds. Key optimizations include FP8 and NVFP4 quantization, disaggregated serving, parallelization techniques, including wide expert parallelism (EP), and advanced speculative decoding techniques, including EAGLE-3 and multi-token prediction.

Demo

TensorRT LLM v1.0 in Action

The release of TRT LLM v1.0 provides simple inference deployment with an open-source and extensible framework. This includes record PyTorch model authorship for rapid development, modular Python runtime for flexibility, and a stable LLM API for seamless deployment. With record-setting 8X AI inference performance improvement, TensorRT LLM v1.0 makes it simple to deliver real-time, cost-efficient LLMs on NVIDIA GPUs.

Watch the Developer Livestream

Get Started with TensorRT LLM

Installation Guides

Install TensorRT LLM on Linux by using pip install or by building from source.

Install Now

Get TensorRT LLM Containers

Containers freely available on NVIDIA NGC™ make it easy to build with TensorRT LLM in a cloud environment.

Get Containers

API Quick-Start Guide

Quickly get set up and start optimizing inference with the LLM API.

Get Started

Key Features

Modular Runtime Built in Python

TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also predefined and can be customized using native PyTorch code, making it easy to adapt the system to specific needs.

PyTorch-Based Model Authoring for Stable LLM API

Architected on PyTorch, TensorRT LLM provides a high-level Python LLM API that supports a wide range of inference setups—from single-GPU to multi-GPU and multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo.

State-of-the-Art Optimizations

TensorRT LLM provides state-of-the-art optimizations, including custom attention kernels, in-flight batching, paged key-value (KV) caching, quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.

Starter Kits

Accelerated Computing Hub

Adding a New Model in PyTorch Backend (GitHub)
Optimizing LLMs Serving With the New NVIDIA TensorRT LLM Container on Google Vertex AI (Google Cloud Article)

Benchmark and Performance Tune LLMs

LLM Inference Benchmarking: Performance Tuning With TensorRT LLM (Blog)
Performance Tuning Guide (GitHub)
trtllm-bench documentation in the TensorRT LLM Benchmarking page (GitHub)
Performance Analysis (GitHub) to use NVIDIA Nsight™ Systems to profile model execution
How to Run TensorRT LLM Tests (GitHub)

Optimize and Deploy Custom LLMs

Learning Library

Documentation

TensorRT Documentation

Explore the quick-start guide, installation guide, release notes, support matrix, and more for TensorRT.

Tech Blog

NVIDIA TensorRT Developer Guide

TensorRT

See how to get started with TensorRT in this step-by-step developer and API reference guide.

Sample App

TensorRT LLM GitHub Repository

TensorRT LLM GitHub Repository

Access TensorRT LLM, an easy-to-use Python API for defining LLMs and building TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Modal

TensorRT LLM Develop

TensorRT LLM provides users with an easy-to-use Python API to define LLMs and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Tech Blog

Post-Training Quantization of LLMs With NVIDIA NeMo and NVIDIA TensorRT Model Optimizer

NeMo | TensorRT | TensorRT LLM

As LLMs are becoming even bigger, it's increasingly important to provide easy-to-use and…

Tech Blog

LLM Inference Benchmarking: Performance Tuning With TensorRT LLM

This is the third post in the large language model latency-throughput benchmarking series, which aims to instruct developers on how to benchmark LLM inference.

Sample App

RAG Chatbot on Windows Reference Project

A developer reference project for creating retrieval-augmented generation (RAG) chatbots on Windows using TensorRT LLM.

Modal

Dynamo Tensorrt-LLM gpt-oss

This container image delivers a ready-to-deploy runtime for Dynamo’s distributed inference framework, purpose-built for OpenAI-compatible models (gpt-oss).

Modal

Llama 2 7B Chat (TensorRT LLM)

NVIDIA Product

Llama 2 is a large language AI model comprising a collection of models capable of generating text and code in response to prompts.

Modal

Gemma 2B Instruct (TensorRT LLM)

Gemma-2B is a 2.5 billion-parameter model from Google’s Gemma family of models. It has been instruction-tuned so it can respond to prompts in a conversational manner.

Modal

Mistral 7B Instruct (TensorRT LLM)

Mistral-7B-Instruct is a language model that can follow instructions, complete requests, and generate creative text formats.

Modal

Phi-2 (TensorRT LLM)

Phi-2 is a 2.7 billion-parameter language model developed by Microsoft Research. The phi-2 model is best suited for prompts using question-answer (QA), chat format, and code formats.

NVIDIA Blackwell Delivers Unmatched Performance and ROI for AI Inference

The NVIDIA Blackwell platform—including NVFP4 low precision format, fifth-generation NVIDIA NVLink and NVLink Switch, and the NVIDIA TensorRT-LLM and NVIDIA Dynamo inference frameworks—enables the highest AI factory revenue: A $5M investment in GB200 NVL72 generates $75 million in token revenue—a 15x return on investment. This includes development with community frameworks such as SGLang, vLLM, and more.

Explore technical results

NVIDIA Rivermax provides real-time streaming for the Las Vegas Sphere, world’s largest LED display

Ecosystem

TensorRT LLM is being widely adopted across industries.

More Resources

Read Blogs

Get Training and Certification

Explore Features and Bug Fixes on GitHub

Sign up for CUDA Developer Newsletter

Read the FAQ

Join the NVIDIA Developer Program

Ethical AI

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Get started with TensorRT LLM today.

Download Now