NVIDIA TensorRT

NVIDIA® TensorRT™ is an ecosystem of tools for developers to achieve high-performance deep learning inference. TensorRT includes inference compilers, runtimes, and model optimizations that deliver low latency and high throughput for production applications. The TensorRT ecosystem includes the TensorRT compiler, TensorRT-LLM, TensorRT Model Optimizer, TensorRT for RTX, and TensorRT Cloud.

Download Now Documentation
GitHub

How TensorRT Works

Speed up inference by 36X compared to CPU-only platforms.

Built on the NVIDIA® CUDA® parallel programming model, TensorRT includes libraries that optimize neural network models trained on all major frameworks, calibrate them for lower precision with high accuracy, and deploy them to hyperscale data centers, workstations, laptops, and edge devices. TensorRT optimizes inference using quantization, layer and tensor fusion, and kernel tuning techniques.

NVIDIA TensorRT Model Optimizer provides easy-to-use quantization techniques, including post-training quantization and quantization-aware training to compress your models. FP8, FP4, INT8, INT4, and advanced techniques such as AWQ are supported for your deep learning inference optimization needs. Quantized inference significantly minimizes latency and memory bandwidth, which is required for many real-time services, autonomous and embedded applications.

Read the Introductory TensorRT Blog

Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

Read Blog

Watch On-Demand TensorRT Sessions From GTC

Learn more about TensorRT and its features from a curated list of webinars at GTC.

Watch Sessions

Get the Complete Developer Guide

See how to get started with TensorRT in this step-by-step developer and API reference guide.

Read Guide

Navigate AI infrastructure and Performance

Learn how to lower your cost per token and get the most out of your AI models with our ebook.

View Ebook

Key Features

Large Language Model Inference

NVIDIA TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of large language models (LLMs) on the NVIDIA AI platform with a simplified Python API.
Developers accelerate LLM performance on NVIDIA GPUs in the data center or on workstation GPUs.

Compile in the Cloud

NVIDIA TensorRT Cloud is a developer-focused service for generating hyper-optimized engines for given constraints and KPIs. Given an LLM and inference throughput/latency requirements, a developer can invoke TensorRT Cloud service using a command-line interface to hyper-optimize a TensorRT-LLM engine for a target GPU. The cloud service will automatically determine the best engine configuration that meets the requirements. Developers can also use the service to build optimized TensorRT engines from ONNX models on a variety of NVIDIA RTX, GeForce, Quadro®, or Tesla®-class GPUs.

TensorRT Cloud is available with limited access to select partners. Apply for access, subject to approval.

Optimize Neural Networks

TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques, including quantization, pruning, speculation, sparsity, and distillation. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM, TensorRT, vLLM, and SGLang to efficiently optimize inference on NVIDIA GPUs. TensorRT Model Optimizer also supports training for inference techniques such as Speculative Decoding Module Training, Pruning/Distillation, and Quantization Aware Training through NeMo and Hugging Face frameworks.

Major Framework Integrations

TensorRT integrates directly into PyTorch and Hugging Face to achieve 6X faster inference with a single line of code. TensorRT provides an ONNX parser to import ONNX models from popular frameworks into TensorRT. MATLAB is integrated with TensorRT through GPU Coder to automatically generate high-performance inference engines for NVIDIA Jetson™, NVIDIA DRIVE®, and data center platforms.

Deploy, Run, and Scale With Dynamo-Triton

TensorRT-optimized models are deployed, run, and scaled with NVIDIA Dynamo Triton inference-serving software that includes TensorRT as a backend. The advantages of using Triton include high throughput with dynamic batching, concurrent model execution, model ensembling, and streaming audio and video inputs.

Simplify AI deployment on RTX

TensorRT for RTX offers an optimized inference deployment solution for NVIDIA RTX GPUs. It facilitates faster engine build times within 15 to 30s, facilitating apps to build inference engines directly on target RTX PCs during app installation or on first run, and does so within a total library footprint of under 200 MB, minimizing memory footprint. Engines built with TensorRT for RTX are cross-OS, cross-GPU portable, ensuring a build once, deploy anywhere workflow.

Accelerate Every Inference Platform

TensorRT can optimize models for applications across the edge, laptops, desktops, and data centers. It powers key NVIDIA solutions—such as NVIDIA TAO, NVIDIA DRIVE, NVIDIA Clara™, and NVIDIA JetPack™—and is integrated with application-specific SDKs, such as NVIDIA NIM™, NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin™, NVIDIA Maxine™, NVIDIA Morpheus, and NVIDIA Broadcast Engine.

TensorRT provides developers a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conferencing, AI-based cybersecurity, and streaming apps in production.

Get Started With TensorRT

TensorRT is an ecosystem of APIs for building and deploying high-performance deep learning inference. It offers a variety of inference solutions for different developer requirements.


Use-case	Deployment Platform	Solution
Inference for LLMs	Data center GPUs like GB100, H100, A100, etc.	Download TRT-LLM TensorRT-LLM is available for free on GitHub. Download (GitHub) Documentation
Inference for non-LLMs like CNNs, Diffusions, Transformers, etc. Safety-compliant and high-performance inference for Automotive Embedded Inference for non-LLMs in robotics and edge applications	Data center GPUs, Embedded, and Edge platforms Automotive platform: NVIDIA DRIVE AGX Edge Platform: Jetson, NVIDIA IGX, etc.	Download TensorRT The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that delivers low latency and high throughput for production applications. Download SDK Download Container
AI Model Inferencing on RTX PCs	NVIDIA GeForce RTX and RTX Pro GPUs in laptops and desktops	Download TensorRT for RTX TensorRT for RTX is a dedicated inference deployment solution for RTX GPUs. Download SDK Documentation
Model optimizations like Quantization, Distillation, Sparsity, etc.	Data center GPUs like GB100, H100, etc.	Download TensorRT Model Optimizer TensorRT Model Optimizer is free on NVIDIA PyPI, with examples and recipes on GitHub. Download (GitHub) Documentation

Get Started With TensorRT Frameworks

TensorRT Frameworks add TensorRT compiler functionality to frameworks like PyTorch.

Download ONNX and Torch-TensorRT

The TensorRT inference library provides a general-purpose AI compiler and an inference runtime that delivers low latency and high throughput for production applications.

ONYX:

Documentation

Torch-TensorRT:

Download Container

Documentation

Experience Tripy: Pythonic Inference With TensorRT

Experience high-performance inference and excellent usability with Tripy. Expect intuitive APIs, easy debugging with eager mode, clear error messages, and top-notch documentation to streamline your deep learning deployment.

Documentation

Examples

Contribute

Deploy

Get a free license to try NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.

Request a 90-Day License

World-Leading Inference Performance

TensorRT was behind NVIDIA’s wins across all inference performance tests in the industry-standard benchmark for MLPerf Inference. TensorRT-LLM accelerates the latest large language models for generative AI, delivering up to 8X more performance, 5.3X better TCO, and nearly 6X lower energy consumption.

See All Benchmarks

8X Increase in GPT-J 6B Inference Performance

4X Higher Llama2 Inference Performance

Total Cost of Ownership

Lower is better

Energy Use

Lower is better

Starter Kits

Beginner Guide to TensorRT

View Quick-Start Guide
View Quick-Start Notebooks
Read Blog: Speeding Up Deep Learning Inference Using NVIDIA TensorRT
Read Blog: Optimizing and Serving Models With TensorRT and Triton
Watch Video: Getting Started With NVIDIA TensorRT

Beginner Guide to TensorRT-LLM

View Quick-Start Guide
View Quick-Start Notebooks
Read Blog: Speeding Up Deep Learning Inference Using NVIDIA TensorRT
Read Blog: Optimizing and Serving Models With TensorRT and Triton
Watch Video: Getting Started With NVIDIA TensorRT

Beginner Guide to TensorRT Model Optimizer

Beginner Guide to Torch-TensorRT

Watch Video: Getting Started With NVIDIA Torch-TensorRT
Read Blog: Accelerate Inference up to 6X in PyTorch
Download Notebook: Object Detection With SSD (Jupyter Notebook)

Beginner Guide to TensorRT Pythonic Frontend: Tripy

Beginner Guide to TensorRT for RTX

TensorRT Learning Library

OSS (Github)

Quantization Quickstart

NVIDIA TensorRT-LLM

The PyTorch backend supports FP8 and NVFP4 quantization. Explore GitHub to pass quantized models in the Hugging Face model hub, which are generated by TensorRT Model Optimizer.

Link to GitHub
Link to PyTorch Documentation

OSS (Github)

Adding a New Model in PyTorch Backend

This guide provides a step-by-step process for adding a new model in PyTorch Backend.

Link to GitHub

OSS (Github)

Using TensoRT-Model Optimizer for Speculative Decoding

ModelOpt’s Speculative Decoding module enables your model to generate multiple tokens in each generation step. This can be useful for reducing the latency of your model and speeding up inference.

Link to GitHub

TensorRT Ecosystem Ecosystem

Widely Adopted Across Industries

More Resources

Explore the Community

Get Training and Certification

Read Top Stories and Blogs

Ethical AI

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here .

Get started with TensorRT today, and use the right inference tools to develop AI for any application on any platform.

Download Now

NVIDIA TensorRT

How TensorRT Works

Read the Introductory TensorRT Blog

Watch On-Demand TensorRT Sessions From GTC

Get the Complete Developer Guide

Navigate AI infrastructure and Performance

Key Features

Large Language Model Inference

Compile in the Cloud

Optimize Neural Networks

Major Framework Integrations

Deploy, Run, and Scale With Dynamo-Triton

Simplify AI deployment on RTX

Accelerate Every Inference Platform

Get Started With TensorRT

Get Started With TensorRT Frameworks

Download ONNX and Torch-TensorRT

Experience Tripy: Pythonic Inference With TensorRT

Deploy

World-Leading Inference Performance

8X Increase in GPT-J 6B Inference Performance

4X Higher Llama2 Inference Performance

Total Cost of Ownership

Energy Use

Starter Kits

Beginner Guide to TensorRT

Beginner Guide to TensorRT-LLM

Beginner Guide to TensorRT Model Optimizer

Beginner Guide to Torch-TensorRT

Beginner Guide to TensorRT Pythonic Frontend: Tripy

Beginner Guide to TensorRT for RTX

Run High-Performance AI Applications with NVIDIA TensorRT for RTX

TensorRT Learning Library

TensorRT Ecosystem Ecosystem

More Resources

Explore the Community

Get Training and Certification

Read Top Stories and Blogs

Ethical AI