Today, NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations. This version also delivers 2x the accuracy for INT8 precision with Quantization Aware Training, and significantly higher performance through support for Sparsity, which was introduced in Ampere GPUs.
TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime that delivers low latency and high throughput. TensorRT is used across industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, Energy, and has been downloaded nearly 2.5 million times.
There have been several kinds of new transformer-based models used across conversational AI. New generalized optimizations in TensorRT can accelerate all such models reducing inference time to half the time vs TensorRT 7.
Highlights from this version include:
- BERT Inference in 1.2 ms with new transformer optimizations
- Achieve accuracy equivalent to FP32 with INT8 precision using Quantization Aware Training
- Introducing Sparsity support for faster inference on Ampere GPUs
You can learn more about Sparsity here.
One of the biggest social media platforms in China, WeChat accelerates its search using TensorRT serving 500M users a month.
“We have implemented TensorRT-and-INT8 QAT-based model inference acceleration to accelerate core tasks of WeChat Search such as Query Understanding and Results Ranking. The conventional limitation of NLP model complexity has been broken-through by our solution with GPU + TensorRT, and BERT/Transformer can be fully integrated in our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimization methods. ” – Huili/Raccoonliu/Dickzhu, WeChat Search
NVIDIA TensorRT is freely available to members of the NVIDIA Developer Program. To learn more, visit the TensorRT product page.
To learn more about TensorRT 8 and its features:
- Real-Time Natural Language Understanding with BERT Using TensorRT
- Achieving FP32 Accuracy for INT8 Inference using Quantization Aware Training with TensorRT
- Accelerating Inference with Sparsity using Ampere Architecture and TensorRT
- Speeding Up Deep Learning Inference Using TensorRT
- Importing models from TensorFlow and ONNX
- TensorRT Quick Start Guide
- Notebook: Optimize Object Detection with EfficientDet and TensorRT
- Notebook: BERT with QAT and Sparsity
Follow these GTC Sessions to get yourself familiar with Technologies:
- GTC Session S31876: Accelerate Deep Learning Inference with TensorRT 8.0
- GTC Session S31552: Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture
- GTC Session S31653: Quantization Aware Training in PyTorch with TensorRT 8.0
- GTC Session S32224: Accelerating Deep Learning Inference with OnnxRuntime-TensorRT
- GTC Session S31732: Inference with Tensorflow 2 Integrated with TensorRT Session
- GTC Session S31828: TensorRT Quick Start Guide