NVIDIA TensorRT

Programmable Inference Accelerator

NVIDIA TensorRTâ„¢ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. TensorRT can be used to rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers, embedded, or automotive product platforms.

(Click to Zoom)

Developers can use TensorRT to deliver fast inference using INT8 or FP16 optimized precision that significantly reduces latency, as demanded by real-time services such as streaming video categorization on the cloud or object detection and segmentation on embedded and automotive platforms.

With TensorRT developers can focus on developing novel AI-powered applications rather than performance tuning for inference deployment. TensorRT runtime ensures optimal inference performance that can meet the most demanding latency and throughput requirements.

TensorRT can be deployed to Tesla GPUs in the datacenter, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.

What's New in TensorRT 3?

TensorRT 3 is the key to unlocking optimal inference performance on Volta GPUs. It delivers up to 40x higher throughput in under 7ms real-time latency vs. CPU-Only inference.
Highlights from this release include:

  • Deliver up to 3.7x faster inference on Tesla V100 vs. Tesla P100 under 7ms real-time latency
  • Optimize and deploy TensorFlow models up to 18x faster compared to TensorFlow framework inference on Tesla V100
  • Improved productivity with easy to use Python API

Learn more about how to get started with TensorRT 3 in the following technical blog posts:


(Click to Zoom)


TensorRT is also available on the following NVIDIA GPU platforms:

TensorRT Optimizations

Weight & Activation Precision Calibration

Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss


Layer & Tensor Fusion

Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution


Kernel Auto-Tuning

Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla, or DrivePX GPU platform


Dynamic Tensor Memory

Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration of its usage


Multi-Stream Execution

Scales to multiple input streams, by processing them in parallel using the same model and weights


Key Features

  • Generate optimized, deployment-ready runtime engines for low latency inference
  • Optimize frequently used neural network layers such as convolutional, fully connected, LRN, pooling, activations, softmax, concat, and deconvolution layers
  • Import models trained using Caffe and TensorFlow or specify network description using the Network Definition API
  • Optimize, validate, and deploy models using the Python API
  • Deploy neural networks in full (FP32) or reduced precision (INT8, FP16)
  • Define and implement unique functionality using the custom layer API

Learn More

8-Bit Inference with TensorRT, GTC talk by Szymon Migacz (NVIDIA)

Deploying Deep Neural Networks with NVIDIA TensorRT, Parallel ForAll technical blog post