NVIDIA TensorRT LLM

NVIDIA TensorRT™ LLM is an open-source library built to deliver high-performance, real-time inference optimization for large language models (LLMs) on NVIDIA GPUs—whether on a desktop or in a data center. It includes a modular Python runtime, PyTorch-native model authoring, and a stable production API. Specifically customized for NVIDIA platforms, TensorRT LLM helps developers maximize inference performance to serve more users in parallel, while minimizing operational costs and delivering blazingly fast experiences.

Download at NVIDIA NGCDownload on GithubRead the Quick-Start Guide


How TensorRT LLM Works

The latest TensorRT LLM architecture is purpose-built to streamline the developer experience—enabling faster iteration and smoother deployment without sacrificing its industry-leading inference performance. The architecture provides easy-to-use Python APIs, a simple CLI, PyTorch model authorship, and an extensible Python framework to enable innovation.

Optimized for peak performance on NVIDIA platforms, TensorRT LLM leverages deep hardware-software integration to deliver unmatched efficiency and speed for LLM inference. Kernels specially designed for NVIDIA hardware achieve peak performance for common LLM inference operations, and runtime optimizations drive GPU utilization and end-user response speeds. Key optimizations include FP8 and NVFP4 quantization, disaggregated serving, parallelization techniques, including wide expert parallelism (EP), and advanced speculative decoding techniques, including EAGLE-3 and multi-token prediction.


Get Started with TensorRT LLM

Installation Guides

Install TensorRT LLM on Linux by using pip install or by building from source. 

Get TensorRT LLM Containers

Containers freely available on NVIDIA NGC™ make it easy to build with TensorRT LLM in a cloud environment. 

API Quick-Start Guide

Quickly get set up and start optimizing inference with the LLM API. 


Key Features

Modular Runtime Built in Python

TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also predefined and can be customized using native PyTorch code, making it easy to adapt the system to specific needs.

PyTorch-Based Model Authoring for Stable LLM API

Architected on PyTorch, TensorRT LLM provides a high-level Python LLM API that supports a wide range of inference setups—from single-GPU to multi-GPU and multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo.

State-of-the-Art Optimizations

TensorRT LLM provides state-of-the-art optimizations, including custom attention kernels, in-flight batching, paged key-value (KV) caching, quantization (FP8, FP4, INT4 AWQ, INT8 SmoothQuant), speculative decoding, and much more, to perform inference efficiently on NVIDIA GPUs.


Starter Kits


Learning Library

Documentation

TensorRT Documentation

Explore the quick-start guide, installation guide, release notes, support matrix, and more for TensorRT.

Tech Blog

NVIDIA TensorRT Developer Guide

TensorRT

See how to get started with TensorRT in this step-by-step developer and API reference guide.

Sample App

TensorRT LLM GitHub Repository

TensorRT LLM GitHub Repository

Access TensorRT LLM, an easy-to-use Python API for defining LLMs and building TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Modal

TensorRT LLM Develop

TensorRT LLM provides users with an easy-to-use Python API to define LLMs and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

Tech Blog

Post-Training Quantization of LLMs With NVIDIA NeMo and NVIDIA TensorRT Model Optimizer

NeMo | TensorRT | TensorRT LLM

As LLMs are becoming even bigger, it's increasingly important to provide easy-to-use and…

Tech Blog

LLM Inference Benchmarking: Performance Tuning With TensorRT LLM

This is the third post in the large language model latency-throughput benchmarking series, which aims to instruct developers on how to benchmark LLM inference.

Sample App

RAG Chatbot on Windows Reference Project

A developer reference project for creating retrieval-augmented generation (RAG) chatbots on Windows using TensorRT LLM.

Modal

Dynamo Tensorrt-LLM gpt-oss

This container image delivers a ready-to-deploy runtime for Dynamo’s distributed inference framework, purpose-built for OpenAI-compatible models (gpt-oss).

Modal

Llama 2 7B Chat (TensorRT LLM)

NVIDIA Product

Llama 2 is a large language AI model comprising a collection of models capable of generating text and code in response to prompts.

Modal

Gemma 2B Instruct (TensorRT LLM)

Gemma-2B is a 2.5 billion-parameter model from Google’s Gemma family of models. It has been instruction-tuned so it can respond to prompts in a conversational manner.

Modal

Mistral 7B Instruct (TensorRT LLM)

Mistral-7B-Instruct is a language model that can follow instructions, complete requests, and generate creative text formats.

Modal

Phi-2 (TensorRT LLM)

Phi-2 is a 2.7 billion-parameter language model developed by Microsoft Research. The phi-2 model is best suited for prompts using question-answer (QA), chat format, and code formats.


Ecosystem

TensorRT LLM is being widely adopted across industries.

TensorRT LLM Ecosystem Partner- AWS
TensorRT LLM Ecosystem Partner- Baseten
TensorRT LLM Ecosystem Partner- Deci
TensorRT LLM Ecosystem Partner- DeepInfra
TensorRT LLM Ecosystem Partner - Grammarly
TensorRT LLM Ecosystem Partner - Google Cloud
TensorRT LLM Ecosystem Partner- Microsoft
TensorRT LLM Ecosystem Partner - OctoML
TensorRT LLM Ecosystem Partner - Tabnine

More Resources

NVIDIA Tech Blog

Read Blogs

NVIDIA Training and Certification

Get Training and Certification

Explore Features and Bug Fixes on GitHub

Explore Features and Bug Fixes on GitHub

NVIDIA Developer Newsletter

Sign up for CUDA Developer Newsletter

NVIDIA TensorRT LLM FAQ

Read the FAQ

Join the NVIDIA Developer Program

Join the NVIDIA Developer Program


Ethical AI

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. 

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Get started with TensorRT LLM today.

Download Now