NVIDIA TensorRT™ is a high-performance deep learning inference optimizer and runtime that delivers low latency, high-throughput inference for deep learning applications. TensorRT can be used to rapidly optimize, validate, and deploy trained neural networks for inference to hyperscale data centers, embedded, or automotive product platforms.
(Click to Zoom)
Developers can use TensorRT to deliver fast inference using INT8 or FP16 optimized precision that significantly reduces latency, as demanded by real-time services such as streaming video categorization on the cloud or object detection and segmentation on embedded and automotive platforms.
With TensorRT developers can focus on developing novel AI-powered applications rather than performance tuning for inference deployment. TensorRT runtime ensures optimal inference performance that can meet the most demanding latency and throughput requirements.
TensorRT can be deployed to Tesla GPUs in the datacenter, Jetson embedded platforms, and NVIDIA DRIVE autonomous driving platforms.
TensorRT 3 is the key to unlocking optimal inference performance on Volta GPUs. It delivers up to 40x higher throughput in under 7ms real-time latency vs. CPU-Only inference.
Highlights from this release include:
Learn more about how to get started with TensorRT 3 in the following technical blog posts:
(Click to Zoom)
TensorRT is also available on the following NVIDIA GPU platforms:
Weight & Activation Precision Calibration
Significantly improves inference performance of models trained in FP32 full precision by quantizing them to INT8, while minimizing accuracy loss
Layer & Tensor Fusion
Improves GPU utilization and optimizes memory storage and bandwidth by combining successive nodes into a single node, for single kernel execution
Optimizes execution time by choosing the best data layer and best parallel algorithms for the target Jetson, Tesla, or DrivePX GPU platform
Dynamic Tensor Memory
Reduces memory footprint and improves memory re-use by allocating memory for each tensor only for the duration of its usage
Scales to multiple input streams, by processing them in parallel using the same model and weights
8-Bit Inference with TensorRT, GTC talk by Szymon Migacz (NVIDIA)
Deploying Unique DL Networks as Micro-Services with TensorRT, user Extensible Layers, and GPU Rest Engines, GTC talk by Chris Gottbrath (NVIDIA)
Deploying Deep Neural Networks with NVIDIA TensorRT, Parallel ForAll technical blog post