Speed Up New models with TensorRT Updates

At GTC Silicon Valley in San Jose, NVIDIA released the latest version of TensorRT 5.1 which includes 20+ new operators and layers, integration with Tensorflow 2.0, ONNX Runtime, and TensorRT Inference Server 1.0.

TensorRT 5.1

TensorRT 5.1 includes support for 20+ new Tensorflow and ONNX operations, ability to update model weights in engines quickly, and a new padding mode to match native framework formats for higher performance. With this new version, applications perform up to 40x faster during inference using mixed precision on Turing GPUs for image/video, translation and speech applications.

Highlights:

Optimize models such as DenseNet and TinyYOLO with support 20+ new layers, activations and operations in TensorFlow and ONNX
Update model weights in an existing engine without rebuilding it
Deploy applications in INT8 precision on Xavier-based NVIDIA AGX platforms using the NVIDIA DLA accelerator

In addition, TensorRT 5.1 includes new samples, new debugging capabilities through support for the NVTX format and bug fixes.

NVIDIA Developer Program members can get access to TensorRT 5.1 Release Candidate here.

NVIDIA TensorRT Integrated with TensorFlow 2.0 and ONNX Runtime

TensorFlow 2.0 was released at Tensorflow Dev Summit in March 2019 with many new exciting features including new and simpler APIs that enable developers to go from data ingestion, transformation, model building, training, and saving, to deployment much more easily.

TensorFlow 2.0 is tightly integrated with and includes NVIDIA TensorRT offering high-performance optimizations with the newly introduced APIs. Developers can continue to get the powerful optimizations of TensorRT with minimal changes to their workflow. Learn more about how to get started with TensorFlow 2.0 here.

Since its release last year, the TensorFlow-TensorRT integration has been expanded to support a wide set of new networks for image classification, object detection and neural collaborative filtering. Find a complete list of supported operations here: [LINK].

ONNX Runtime integration with NVIDIA TensorRT in preview

Microsoft released an open source preview of NVIDIA TensorRT integration with ONNX Runtime. With this release, Microsoft offers another step towards open and interoperable AI by enabling developers to easily leverage industry-leading GPU acceleration regardless of their choice of framework. Developers can now tap into the power of TensorRT through ONNX Runtime to accelerate inferencing of ONNX models, which can be exported or converted from any popular framework.

Learn More

TensorRT Inference Server 1.0 GA

Version 1.0.0 of Tensor RT Inference server is available. It includes:

The new Audio Streaming API : It provides support for “stateful” sequence models that have a series of inputs to perform inference on. The sequence batcher inside TensorRT Inference Server loads the sequence inputs from the queues and sends them to the appropriate instance of the model for execution. With this Audio Streaming API, use case such as Audio Speech Recognition and translation are now supported
Bug fixes and enhancements
All future versions will be backward compatible with this version.

Tensor RT Inference server increases inference throughput, GPU utilization and simplifies deploying inference in production data center through

Concurrent model execution – multiple models may execute on a single GPU simultaneously
Custom Backend – user can provide their own implementation of an execution engine through the use of a shared library
Dynamic batching – inference requests can be batched by the inference server
Multiple model format support – Tensorflow graphdef/savedmodel, TF-TRT, TRT plans, Caffe 2 netdef support

Download the TensorRT inference server from the NVIDIA GPU Cloud container registry or from GitHub.

Learn how to use the TensorRT inference server in this user guide