Join the NVIDIA Triton and NVIDIA TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more.  Register Free

NVIDIA TensorRT

NVIDIA® TensorRT, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.

Download now Get started
Inference pipeline with NVIDIA TensorRT

What is NVIDIA TensorRT?

TensorRT speeds up inference by 36X

Speed up inference by 36X.

NVIDIA TensorRT-based applications perform up to 36X faster than CPU-only platforms during inference, enabling you to optimize neural network models trained on all major frameworks, calibrate for lower precision with high accuracy, and deploy to hyperscale data centers, embedded platforms, or automotive product platforms.

TensorRT helps to optimize inference performance

Optimize inference performance.

TensorRT, built on the NVIDIA CUDA® parallel programming model, enables you to optimize inference by leveraging libraries, development tools, and technologies in NVIDIA AI, autonomous machines, high-performance computing, and graphics. With NVIDIA Hopper™ and NVIDIA Ampere Architecture GPUs, TensorRT also uses sparse Tensor Cores for an additional performance boost.

TensorRT helps to accelerate every workload

Accelerate every workload.

TensorRT provides INT8 using quantization-aware training and post-training quantization and FP16 optimizations for deployment of deep learning inference applications, such as video streaming, recommendations, fraud detection, and natural language processing. Reduced-precision inference significantly minimizes latency, which is required for many real-time services, as well as autonomous and embedded applications.

TensorRT-optimized models can be deployed, run, and scaled with NVIDIA Triton

Deploy, run, and scale with Triton.

TensorRT-optimized models can be deployed, run, and scaled with NVIDIA Triton™, an open-source inference serving software that includes TensorRT as one of its backends. The advantage of using Triton is high throughput with dynamic batching and concurrent model execution and use of features like model ensembles, streaming audio/video inputs, and more.

World-Leading Inference Performance.

TensorRT was behind NVIDIA’s wins across all performance tests in the industry-standard benchmark for MLPerf Inference. It also accelerates every workload across the data center and edge in computer vision, automatic speech recognition, natural language understanding (BERT), text-to-speech, and recommender systems.

Conversational AI

Computer Vision

Recommender Systems


Supports All Major Frameworks.

TensorRT is integrated with PyTorch and TensorFlow so you can achieve 6X faster inference with a single line of code. If you’re performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.

Below are a few integrations with information on how to get started.

PyTorch

Accelerate PyTorch models using the new Torch-TensorRT Integration with just one line of code. Get 6X faster inference using the TensorRT optimizations in a familiar PyTorch environment.

Learn More

TensorFlow

TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT like 6X the performance with one line of code.

Learn More

ONNX

TensorRT provides an ONNX parser so you can easily import ONNX models from popular frameworks into TensorRT. It’s also integrated with ONNX Runtime, providing an easy way to achieve high-performance inference in the ONNX format.

Learn More

Matlab

MATLAB is integrated with TensorRT through GPU Coder so you can automatically generate high-performance inference engines for NVIDIA Jetson™, NVIDIA DRIVE®, and data center platforms.

Learn More

Accelerates Every Inference Platform.

TensorRT can optimize and deploy applications to the data center, as well as embedded and automotive environments. It powers key NVIDIA solutions such as NVIDIA TAO, NVIDIA DRIVE™, NVIDIA Clara™, and NVIDIA Jetpack™.

TensorRT is also integrated with application-specific SDKs, such as NVIDIA DeepStream, NVIDIA Riva, NVIDIA Merlin™, NVIDIA Maxine™, NVIDIA Modulus, NVIDIA Morpheus, and Broadcast Engine to provide developers with a unified path to deploy intelligent video analytics, speech AI, recommender systems, video conference, AI based cybersecurity, and streaming apps in production.

NVIDIA TensorRT accelerates every inference platform.

Join the Triton community and stay current on the latest feature updates, bug fixes, and more.

Sign Up

Read Success Stories.

Learn how NVIDIA TensorRT supports Amazon.
amazon logo

Discover how Amazon improved customer satisfaction by accelerating its inference 5X faster.

Learn More
Learn how NVIDIA TensorRT supports AMEX.
american express logo

American Express improves fraud detection by analyzing tens of millions of daily transactions 50X faster. Find out how.

Learn More
Learn how NVIDIA TensorRT supports Zoox.
zoox logo

Explore how Zoox, a robotaxi startup, accelerated their perception stack by 19X using TensorRT for real-time inference on autonomous vehicles.

Learn More

Widely-Adopted Across Industries.

NVIDIA TensorRT is widely adopted by top companies across industries

Explore Introductory Resources.

Read the introductory TensorRT blog

Learn how to apply TensorRT optimizations and deploy a PyTorch model to GPUs.

Read blog

Watch on-demand TensorRT sessions from GTC

Learn more about TensorRT and its new features from a curated list of webinars of GTC 2022.

Watch Sessions

Get the introductory developer guide

See how to get started with NVIDIA TensorRT in this step-by-step developer and API reference guide.

Read Guide

Need enterprise support? NVIDIA global support is available for TensorRT with the NVIDIA AI software suite.