NVIDIA Announces TensorRT 8 Slashing BERT-Large Inference Down to 1 Millisecond

Discuss (0)
Join the NVIDIA Triton and NVIDIA TensorRT community to stay current on the latest product updates, bug fixes, content, best practices, and more.

Today, NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations. This version also delivers 2x the accuracy for INT8 precision with Quantization Aware Training, and significantly higher performance through support for Sparsity, which was introduced in Ampere GPUs.

TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime that delivers low latency and high throughput. TensorRT is used across industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, Energy, and has been downloaded nearly 2.5 million times.

There have been several kinds of new transformer-based models used across conversational AI. New generalized optimizations in TensorRT can accelerate all such models reducing inference time to half the time vs TensorRT 7.

Highlights from this version include:

  • BERT Inference in 1.2 ms with new transformer optimizations
  • Achieve accuracy equivalent to FP32 with INT8 precision using Quantization Aware Training
  • Introducing Sparsity support for faster inference on Ampere GPUs

You can learn more about Sparsity here.

One of the biggest social media platforms in China, WeChat accelerates its search using TensorRT serving 500M users a month.

“We have implemented TensorRT-and-INT8 QAT-based model inference acceleration to accelerate core tasks of WeChat Search such as Query Understanding and Results Ranking. The conventional limitation of NLP model complexity has been broken-through by our solution with GPU + TensorRT, and BERT/Transformer can be fully integrated in our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimization methods. ” – Huili/Raccoonliu/Dickzhu, WeChat Search

Figure 1. Leading adopters across all verticals.

NVIDIA TensorRT is freely available to members of the NVIDIA Developer Program. To learn more, visit the TensorRT product page.

To learn more about TensorRT 8 and its features:

Follow these GTC Sessions to get yourself familiar with Technologies: