NVIDIA TensorRT

Programmable Inference Accelerator

NVIDIA TensorRT™ is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. TensorRT-based applications perform up to 40x faster than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy to hyperscale data centers, embedded, or automotive product platforms.

TensorRT provides INT8 and FP16 optimizations for production deployments of deep learning inference applications such as video streaming, speech recognition, recommendation and natural language processing. Reduced precision inference significantly reduces application latency, which is a requirement for many real-time services, auto and embedded applications.

Rocket Fuel

(Click to Zoom)

You can import trained models from every deep learning framework into TensorRT. After applying optimizations, TensorRT selects platform specific kernels to maximize performance on Tesla GPUs in the data center, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.

To use AI models in data center production, the TensorRT inference server is a containerized microservice that maximizes GPU utilization and runs multiple models from different frameworks concurrently on a node. It leverages Docker and Kubernetes to integrate seamlessly into DevOps architectures.

With TensorRT developers can focus on creating novel AI-powered applications rather than performance tuning for inference deployment.


TensorRT Optimizations

Weight & Activation Precision Calibration

Maximizes throughput by quantizing models to INT8 while preserving accuracy


Layer & Tensor Fusion

Optimizes use of GPU memory and bandwidth by fusing nodes in a kernel


Kernel Auto-Tuning

Selects best data layers and algorithms based on target GPU platform


Dynamic Tensor Memory

Minimizes memory footprint and re-uses memory for tensors efficiently


Multi-Stream Execution

Scalable design to process multiple input streams in parallel



Widely Adopted


Integrated with All Major Frameworks

NVIDIA works closely with deep learning framework developers to achieve optimized performance for inference on AI platforms using TensorRT. If your training models are in the ONNX format or other popular frameworks such as TensorFlow and MATLAB, there are easy ways for you to import models into TensorRT for inference. Below are few integrations with information on how to get started.

Tensor logo

TensorRT and TensorFlow are tightly integrated so you get the flexibility of TensorFlow with the powerful optimizations of TensorRT. Learn more in the TensorRT integrated with TensorFlow blog post.

Mathworks logo

MATLAB is integrated with TensorRT through GPU Coder so that engineers and scientists using MATLAB can automatically generate high-performant inference engines for Jetson, DRIVE and Tesla platforms. Learn more in this webinar.

ONNX logo

TensorRT provides an ONNX parser so you can easily import ONNX models from frameworks such as Caffe 2, Chainer, Microsoft Cognitive Toolkit, MxNet and PyTorch into TensorRT. Learn more about ONNX support in TensorRT here.

If you are performing deep learning training in a proprietary or custom framework, use the TensorRT C++ API to import and accelerate your models. Read more in the TensorRT documentation.


TensorRT Inference Server

The NVIDIA TensorRT inference server makes state-of-the-art AI-driven experiences possible in real-time. It’s a containerized inference microservice for data center production that maximizes GPU utilization and seamlessly integrates into DevOps deployments with Docker and Kubernetes integration.

The TensorRT inference server:

  • Maximizes utilization by enabling inference for multiple models on one or more GPUs
  • Supports all popular AI frameworks
  • Dynamically batches requests to increase throughput
  • Provides metrics for orchestration and load balancing

With the NVIDIA TensorRT inference server, there’s now a common solution for AI inference, allowing researchers to focus on creating high-quality trained models, DevOps engineers to focus on deployment, and developers to focus on their applications, without needing to reinvent the plumbing for each AI-powered application.

Learn more in the NVIDIA’s TensorRT Inference Server blog.

The TensorRT inference server is freely available to members of the NVIDIA developer program. An early version of TensorRT inference server container is available for download from the NVIDIA CPU Cloud container registry.


NVIDIA TensorRT Performance Guide

The benefit of TensorRT is its accelerated performance on NVIDIA GPUs. See how it can power your deep learning inference needs across multiple networks with high throughput and ultra-low latency.


“In our evaluation of TensorRT running our deep learning-based recommendation application on NVIDIA Tesla V100 GPUs, we experienced a 45x increase in inference speed and throughput compared with a CPU-based platform. We believe TensorRT could dramatically improve productivity for our enterprise customers.”

— Markus Noga, Head of Machine Learning at SAP SAP logo


What's New in TensorRT 5 and TensorRT inference server

TensorRT 5 delivers up to 40x faster inference over CPU-only platforms through support for Turing GPUs, new INT8 APIs and optimizations. It uses multi-precision compute to dramatically speed up recommenders, neural machine translation, speech and natural language processing. With TensorRT 5, you can:

  • Speed up inference by 40x over CPUs for models such as translation using mixed precision on Turing Tensor Cores
  • Optimize inference models with new INT8 APIs and optimizations
  • Deploy applications to Xavier-based NVIDIA Drive platforms and the NVIDIA DLA accelerator (FP16 only)

In addition, TensorRT 5 supports Windows and the CentOS Operating System. TensorRT 5 RC is available for download now to members of the NVIDIA Developer Program.


The NVIDIA TensorRT inference server is a containerized inference microservice that maximizes GPU utilization and seamlessly integrates into DevOps deployments with Docker and Kubernetes. A beta version of TensorRT inference server is available in NVIDIA CPU Cloud for you to experiment with today.


Additional Resources

Overview
Natural Language Processing
Recommenders

You can find additional resources on https://devblogs.nvidia.com/tag/tensorrt/ and interact with the TensorRT developer community on the TensorRT Forum


Availability

TensorRT is freely available to members of the NVIDIA Developer Program from the TensorRT product page for development and deployment.

Developers can also get TensorRT in the TensorRT Container on NVIDIA GPU Cloud (NGC).

The TensorRT inference server is available in a ready-to-run, standalone container from NVIDIA GPU Cloud (NGC).

TensorRT is included in: