NVIDIA TensorRTâ„¢ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. TensorRT can be used to rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers, embedded, or automotive product platforms.

(Click to Zoom)

Developers can use TensorRT to deliver fast inference using INT8 or FP16 optimized precision that significantly reduces latency, as demanded by real-time services such as streaming video categorization on the cloud or object detection and segmentation on embedded and automotive platforms. With TensorRT developers can focus on developing novel AI-powered applications rather than performance tuning for inference deployment. TensorRT runtime ensures optimal inference performance that can meet the most demanding latency and throughput requirements.

What's New in TensorRT 3?

TensorRT 3 is the key to unlocking optimal inference performance on Volta GPUs. It delivers up to 40x higher throughput in under 7ms real-time latency vs. CPU-Only inference.
Highlights from this release include:

  • Deliver up to 3.7x faster inference on Tesla V100 vs. Tesla P100 under 7ms real-time latency
  • Optimize and deploy TensorFlow models up to 18x faster compared to TensorFlow framework inference on Tesla V100
  • Improved productivity with easy to use Python API

TensorRT 3 release candidate for Tesla GPU (P4, P100, V100) and Jetson embedded platforms are now available as free downloads to the members of the NVIDIA developer program.

(Click to Zoom)

TensorRT 2

TensorRT 2 production release is now available as a free download to the members of the NVIDIA Developer Program.

  • Deliver up to 45x faster inference under 7 ms real-time latency with INT8 precision
  • Integrate novel user-defined layers as plugins using Custom Layer API
  • Deploy sequence-based models for image captioning, language translation and other applications using LSTM and GRU Recurrent Neural Networks (RNN) layers

(Click to Zoom)

(Click to Zoom)

(Click to Zoom)

TensorRT Optimizations

Weight & Activation Precision Calibration

Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss

Layer & Tensor Fusion

Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution

Kernel Auto-Tuning

Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla, or DrivePX GPU platform

Dynamic Tensor Memory

Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration of its usage

Multi-Stream Execution

Scales to multiple input streams, by processing them in parallel using the same model and weights

Key Features

  • Generate optimized, deployment-ready runtime engines for low latency inference
  • Optimize frequently used neural network layers such as convolutional, fully connected, LRN, pooling, activations, softmax, concat, and deconvolution layers
  • Import models trained using Caffe and TensorFlow or specify network description using the Network Definition API
  • Optimize, validate, and deploy models using the Python API
  • Deploy neural networks in full (FP32) or reduced precision (INT8, FP16)
  • Define and implement unique functionality using the custom layer API

Learn More

8-Bit Inference with TensorRT, GTC talk by Szymon Migacz (NVIDIA)

Deploying Deep Neural Networks with NVIDIA TensorRT, Parallel ForAll technical blog post