Programmable Inference Accelerator

NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. With TensorRT, you can optimize neural network models, calibrate for lower precision with high accuracy, and finally deploy the models to hyperscale data centers, embedded, or automotive product platforms. TensorRT-based applications on GPUs perform up to 100x faster than CPU during inference for models trained in all major frameworks.

TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation and natural language processing. Reduced precision inference significantly lowers application latency, which is a requirement for many real-time services as well as auto and embedded applications.

(Click to Zoom)

You can import trained models from every deep learning framework into TensorRT. After applying optimizations, TensorRT selects platform specific kernels to maximize performance on Tesla GPUs in the datacenter, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms. NVIDIA recommends Tesla V100, P100, P4, and P40 GPUs for production deployment.

With TensorRT developers can focus on creating novel AI-powered applications rather than performance tuning for inference deployment.

TensorRT Optimizations

Weight & Activation Precision Calibration

Maximizes throughput by quantizing models to INT8 while preserving accuracy

Layer & Tensor Fusion

Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel

Kernel Auto-Tuning

Selects best data layers and algorithms based on target GPU platform

Dynamic Tensor Memory

Minimizes memory footprint and re-uses memory for tensors efficiently

Multi-Stream Execution

Scalable design to process multiple input streams in parallel

“In our evaluation of TensorRT running our deep learning-based recommendation application on NVIDIA Tesla V100 GPUs, we experienced a 45x increase in inference speed and throughput compared with a CPU-based platform. We believe TensorRT could dramatically improve productivity for our enterprise customers.”

— Markus Noga, Head of Machine Learning at SAP SAP logo

Framework Integrations

NVIDIA works closely with deep learning framework developers to achieve optimized performance for inference on AI platforms using TensorRT. If your training models are in the ONNX format or other popular frameworks such as TensorFlow and MATLAB, there are easy ways for you to import models into TensorRT for inference. Below are few integrations with information on how to get started.

Tensor logo

TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT. Learn more in the TensorRT integrated with TensorFlow blog post.

ONNX logo

TensorRT 4 provides an ONNX parser so you can easily import ONNX models from frameworks such as Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet and PyTorch into TensorRT. Learn more about ONNX support in TensorRT here.

Mathworks logo

MATLAB is integrated with TensorRT through GPU Coder so that engineers and scientists using MATLAB can automatically generate high-performant inference engines for Jetson, DRIVE and Tesla platforms. Learn more in this webinar.

If you are performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.

NVIDIA TensorRT Performance Guide

The benefit of TensorRT is its accelerated performance on NVIDIA GPUs. See how it can power your deep learning inference needs across multiple networks with high throughput and ultra-low latency.

Widely Adopted

TensorRT 4: What’s New

TensorRT 4 now provides capabilities to accelerate speech recognition, neural machine translation and recommender systems. The native ONNX parser in TensorRT 4 provides an easy path to import models from frameworks such as PyTorch, Caffe2, MxNet, CNTK and Chainer.

Highlights include:

  • 45x higher throughput vs. CPU with new layers for Multilayer Perceptrons (MLP) and Recurrent Neural Networks (RNN)
  • 50x faster inference performance on V100 vs. CPU-only for ONNX models imported with ONNX parser in TensorRT
  • Support for NVIDIA DRIVE™ Xavier - AI Computer for Autonomous Vehicles
  • 3x inference speedup for FP16 custom layers with APIs for running on Volta Tensor Cores

See the TensorRT 4 Accelerates Neural Machine Translation, Recommenders and Speech developer blog to learn more about the new capabilities. You can get TensorRT as a container from NVIDIA GPU Cloud (NGC) or as a package from the button below.

(Click to Zoom)


TensorRT is freely available to members of the NVIDIA Developer Program from the TensorRT product page for development and deployment.

TensorRT is also available in the NVIDIA GPU Cloud (NGC) TensorRT container for deployment in the cloud.

TensorRT is included in: