NVIDIA TensorRT

NVIDIA® TensorRT, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

Get Started
Inference Pipeline with NVIDIA TensorRT

NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling developers to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms.

TensorRT, built on the NVIDIA CUDA® parallel programming model, enables developers to optimize inference by leveraging libraries, development tools, and technologies in CUDA-X for AI, autonomous machines, high performance computing, and graphics. With new NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse tensor cores for an additional performance boost.

TensorRT provides INT8 using quantization aware training and post-raining quantization, and FP16 optimizations for production deployments of deep learning inference applications, such as video streaming, speech recognition, recommendation, fraud detection, text generation, and natural language processing. Reduced precision inference significantly minimizes application latency, which is a requirement for many real-time services, as well as autonomous and embedded applications.

With TensorRT, developers can focus on creating novel AI-powered applications rather than inference optimization. TensorRT optimized models can then be deployed with NVIDIA Triton™, an open-source inference serving software that includes TensorRT as one of its backends.

TensorRT Features

1. Reduced Precision

Maximizes throughput with FP16 or INT8 by quantizing models while preserving accuracy

2. Layer and Tensor Fusion

Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel

3. Kernel Auto-Tuning

Selects best data layers and algorithms based on the target GPU platform

4. Dynamic Tensor Memory

Minimizes memory footprint and reuses memory for tensors efficiently

5. Multi-Stream Execution

Uses a scalable design to process multiple input streams in parallel

6. Time Fusion

Optimizes recurrent neural networks over time steps with dynamically generated kernels


Key Products

TensorRT

Torch-TensorRT

TensorFlow-TensorRT


World-Leading Inference Performance

TensorRT powered NVIDIA’s wins across all performance tests in the industry-standard MLPerf Inference benchmark. It also accelerates every model across the data center and edge in computer vision, speech-to-text, natural language understanding (BERT), and recommender systems.

Conversational AI

Computer Vision

Recommender Systems


Accelerates Every Inference Platform

TensorRT can optimize and deploy applications to the data center, as well as embedded and automotive environments. It powers key NVIDIA solutions such as NVIDIA TAO, NVIDIA DRIVE™, NVIDIA Clara™, and NVIDIA Jetpack™.

TensorRT is also integrated with application-specific SDKs, such as NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin™, NVIDIA Maxine™, NVIDIA Modulus, NVIDIA Morpheus, and Broadcast Engine to provide developers with a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conference, AI based cybersecurity, and streaming apps in production.

NVIDIA TensorRT accelerates every inference platform.

Accelerate your deep learning inference today with NVIDIA TensorRT.

Get started

Supports All Major Frameworks

TensorRT is integrated with PyTorch and TensorFlow so you can achieve 6x faster inference with 1 line of code. If you are performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.

Below are a few integrations with information on how to get started.

PyTorch models using NVIDIA Torch-TensorRT.

Accelerate PyTorch models using the new Torch-TensorRT Integration with just one line of code. Get 6X faster inference using the TensorRT optimizations in a familiar PyTorch environment.

LEARN MORE
NVIDIA TensorRT and TensorFlow are tightly integrated for optimal performance.

TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT like 6X the performance with one line of code.

LEARN MORE
NVIDIA TensorRT and ONNX

TensorRT provides an ONNX parser so you can easily import ONNX models from popular frameworks into TensorRT. It’s also integrated with ONNX Runtime, providing an easy way to achieve high-performance inference in the ONNX format.

LEARN MORE
MATLAB is integrated with NVIDIA TensorRT through GPU Coder.

MATLAB is integrated with TensorRT through GPU Coder so you can automatically generate high performance inference engines for NVIDIA Jetson™, NVIDIA DRIVE®, and data center platforms.

LEARN MORE

Success Stories

Learn how NVIDIA TensorRT supports Amazon.
amazon logo

Discover how Amazon improved customer satisfaction by accelerating its inference 5x faster.

LEARN MORE
Learn how NVIDIA TensorRT supports AMEX.
american express logo

American Express improves fraud prevention by analyzing tens of millions of daily transactions 50X faster. Find out how.

LEARN MORE
Learn how NVIDIA TensorRT supports Zoox.
zoox logo

Explore how Zoox, a robotaxi startup, accelerated their perception stack by 19X using TensorRT for real-time inference on autonomous vehicles.

LEARN MORE

Widely-Adopted Across Industries

NVIDIA TensorRT is widely adopted by top companies across industries

Introductory Resources

Introductory Blog

Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

Read blog

Introductory Webinar

Learn more about TensorRT 8.4 features and tools that simplify the inference workflow.

Watch webinar

Introductory Developer Guide

See how to get started with TensorRT in this step-by-step developer guide and API reference.

Read guide

NVIDIA TensorRT is a free download from NGC for NVIDIA Developer Program members. Open-source samples and parsers are available from GitHub.

Get Started