NVIDIA TensorRT
NVIDIA TensorRT™ is an SDK for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications.
TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.
TensorRT is built on CUDA, NVIDIA’s parallel programming model, and enables you to optimize inference for all deep learning frameworks leveraging libraries, development tools and technologies in CUDA-X for artificial intelligence, autonomous machines, high-performance computing, and graphics.
TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation, fraud detection, and natural language processing. Reduced precision inference significantly reduces application latency, which is a requirement for many real-time services, auto and embedded applications.
You can import trained models from every deep learning framework into TensorRT. After applying optimizations, TensorRT selects platform specific kernels to maximize performance on Tesla GPUs in the data center, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.
With TensorRT developers can focus on creating novel AI-powered applications rather than performance tuning for inference deployment.
TensorRT Optimizations and Performance

Weight & Activation Precision Calibration
Maximizes throughput by quantizing models to INT8 while preserving accuracy

Layer & Tensor Fusion
Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel

Kernel Auto-Tuning
Selects best data layers and algorithms based on target GPU platform

Dynamic Tensor Memory
Minimizes memory footprint and re-uses memory for tensors efficiently

Multi-Stream Execution
Scalable design to process multiple input streams in parallel
TensorRT dramatically accelerates deep learning inference performance on NVIDIA GPUs. See how it can power your inference needs across multiple networks with high throughput and ultra-low latency.
Widely Adopted

Integrated with All Major Frameworks
NVIDIA works closely with deep learning framework developers to achieve optimized performance for inference on AI platforms using TensorRT. If your training models are in the ONNX format or other popular frameworks such as TensorFlow and MATLAB, there are easy ways for you to import models into TensorRT for inference. Below are few integrations with information on how to get started.

TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT. Learn more in the TensorRT integrated with TensorFlow blog post.

MATLAB is integrated with TensorRT through GPU Coder so that engineers and scientists using MATLAB can automatically generate high-performant inference engines for Jetson, DRIVE and Tesla platforms. Learn more in this webinar.

TensorRT provides an ONNX parser so you can easily import ONNX models from frameworks such as Caffe 2, Microsoft Cognitive Toolkit, MxNet, Chainer and PyTorch into TensorRT. Learn more about ONNX support in TensorRT here.
TensorRT is also integrated with ONNX Runtime, providing an easy way to achieve high-performance inference for machine learning models in the ONNX format. Learn more about ONNX Runtime - TensorRT integration here.
If you are performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.
“In our evaluation of TensorRT running our deep learning-based recommendation application on NVIDIA Tesla V100 GPUs, we experienced a 45x increase in inference speed and throughput compared with a CPU-based platform. We believe TensorRT could dramatically improve productivity for our enterprise customers.”
— Markus Noga, Head of Machine Learning at SAP
![]()
“By using tensor cores on the V100, the most recently optimized CUDA libraries and the TF-TRT backend we were able to speed up our already fast DL network by a factor of 4x”
— Kris Bhaskar, KLA Senior Fellow, VP AI initiatives, KLA
![]()
“Criteo uses Nvidia's TensorRT over T4 cards to optimize its deep-learning models for faster inference on GPUs. Now, removing inappropriate images over billions of them is 4 times faster. It also consumes half less energy.”
— Suju Rajan, SVP Research, Criteo
![]()
Announcing TensorRT 7.2: What's New
TensorRT 7.2 is packed with new optimizations that accelerate video-based workloads such as web conferencing and content streaming. The new optimizations deliver 30x higher performance vs CPU, making it possible to run high quality effects such as super-resolution, noise removal and virtual backgrounds live. TensorRT 7.2 also contains optimizations for RNNs that can speed up applications such as Fraud and Anomaly detection by 2x versus earlier versions.
TensorRT 7.2 is available now.
Additional Resources

- TensorRT Container, Models and Scripts in NGC
- “Hello World” For TensorRT (Sample Code)
- “Hello World” For TensorRT From ONNX (Sample Code)
- Performing Inference In INT8 Using Custom Calibration (Sample Code)
- Introduction to TensorRT (Webinar)
- 8-Bit Inference with TensorRT (Webinar)

- Real-Time Natural Language Understanding with BERT Using TensorRT (Blog)
- Automatic Speech Recognition with TensorRT (Notebook)
- Accelerating Real-Time Text-to-Speech with you TensorRT (Blog)
- NLU with BERT (Notebook) (Notebook)
- Real Time Text-to-Speech (Sample)
- Neural Machine Translation (NMT) Using A Sequence To Sequence (seq2seq) Model(Sample Code)
- Building An RNN Network Layer By Layer(Sample Code)

- Accelerating Wide and Deep with TensorRT (Blog)
- Movie Recommendation Using Neural Collaborative Filter (NCF)(Sample Code)
- Deep Recommender (Sample Code)
- Intro to Recommenders in TensorRT (Video)

- Real-time object detection on GPUs in 10 mins (Blog)
- How to perform inference for common applications (Webinar)
- Creating object detection pipelines on GPUs (Blog)
- Object detection with SSD network (Python Code Sample)
- Object detection with SSD, Faster R-CNN networks (C++ Code Samples)
You can find additional resources on https://devblogs.nvidia.com/tag/tensorrt/ and interact with the TensorRT developer community on the TensorRT Forum
Get Started With Hands-On Training
The NVIDIA Deep Learning Institute (DLI) offers hands-on training for developers, data scientists, and researchers in AI and accelerated computing.Get hands-on experience with TensorRT in self-paced electives on Optimization and Deployment of TensorFlow Models with TensorRT today.
Availability
TensorRT is freely available to members of the NVIDIA Developer Program from the TensorRT product page for development and deployment. The latest version of plugins, parsers and samples are also available as open source from the TensorRT github repository.
Developers can also get TensorRT in the TensorRT Container from the NGC container registry.
TensorRT is included in:
- NVIDIA Jetpack for Jetson TX1, TX2 embedded platforms
- NVIDIA Deepstream SDK for real-time streaming analytics in Computer Vision and Intelligent Video Analytics (IVA) applications
- NVIDIA Isaac SDK which is a Robotic AI Development Platform for Simulation, Navigation and Manipulation
- NVIDIA DriveInstall for NVIDIA DRIVE PX2 autonomous driving platform