At GTC Silicon Valley in San Jose, NVIDIA released the latest version of TensorRT 5.1 which includes 20+ new operators and layers, integration with Tensorflow 2.0, ONNX Runtime, and TensorRT Inference Server 1.0.
TensorRT 5.1 includes support for 20+ new Tensorflow and ONNX operations, ability to update model weights in engines quickly, and a new padding mode to match native framework formats for higher performance. With this new version, applications perform up to 40x faster during inference using mixed precision on Turing GPUs for image/video, translation and speech applications.
- Optimize models such as DenseNet and TinyYOLO with support 20+ new layers, activations and operations in TensorFlow and ONNX
- Update model weights in an existing engine without rebuilding it
- Deploy applications in INT8 precision on Xavier-based NVIDIA AGX platforms using the NVIDIA DLA accelerator
In addition, TensorRT 5.1 includes new samples, new debugging capabilities through support for the NVTX format and bug fixes.
NVIDIA Developer Program members can get access to TensorRT 5.1 Release Candidate here.
NVIDIA TensorRT Integrated with TensorFlow 2.0 and ONNX Runtime
TensorFlow 2.0 was released at Tensorflow Dev Summit in March 2019 with many new exciting features including new and simpler APIs that enable developers to go from data ingestion, transformation, model building, training, and saving, to deployment much more easily.
TensorFlow 2.0 is tightly integrated with and includes NVIDIA TensorRT offering high-performance optimizations with the newly introduced APIs. Developers can continue to get the powerful optimizations of TensorRT with minimal changes to their workflow. Learn more about how to get started with TensorFlow 2.0 here.
Since its release last year, the TensorFlow-TensorRT integration has been expanded to support a wide set of new networks for image classification, object detection and neural collaborative filtering. Find a complete list of supported operations here: [LINK].
ONNX Runtime integration with NVIDIA TensorRT in preview
Microsoft released an open source preview of NVIDIA TensorRT integration with ONNX Runtime. With this release, Microsoft offers another step towards open and interoperable AI by enabling developers to easily leverage industry-leading GPU acceleration regardless of their choice of framework. Developers can now tap into the power of TensorRT through ONNX Runtime to accelerate inferencing of ONNX models, which can be exported or converted from any popular framework.
TensorRT Inference Server 1.0 GA
Version 1.0.0 of Tensor RT Inference server is available. It includes:
- The new Audio Streaming API : It provides support for “stateful” sequence models that have a series of inputs to perform inference on. The sequence batcher inside TensorRT Inference Server loads the sequence inputs from the queues and sends them to the appropriate instance of the model for execution. With this Audio Streaming API, use case such as Audio Speech Recognition and translation are now supported
- Bug fixes and enhancements
- All future versions will be backward compatible with this version.
Tensor RT Inference server increases inference throughput, GPU utilization and simplifies deploying inference in production data center through
- Concurrent model execution – multiple models may execute on a single GPU simultaneously
- Custom Backend – user can provide their own implementation of an execution engine through the use of a shared library
- Dynamic batching – inference requests can be batched by the inference server
- Multiple model format support – Tensorflow graphdef/savedmodel, TF-TRT, TRT plans, Caffe 2 netdef support
Download the TensorRT inference server from the NVIDIA GPU Cloud container registry or from GitHub.
Learn how to use the TensorRT inference server in this user guide