After clicking “Watch Now” you will be prompted to login or join.
Toward INT8 Inference: Deploying Quantization-Aware Trained Networks using TensorRT
Dheeraj Peri, NVIDIA | Jhalak Patel, NVIDIA
GTC 2020
We'll describe how TensorRT can optimize the quantization ops and demonstrate an end-to-end workflow for running quantized networks. Accelerating deep neural networks (DNN) is a critical step in realizing the benefits of AI for real-world use cases. The need to improve DNN inference latency has sparked interest in lower precision, such as FP16 and INT8 precision, which offer faster inference time. Two prevalent techniques to convert FP32 DNNs to INT8 precision are post-training quantization and quantization-aware training (QAT). TensorRT, a platform for high-performance deep learning inference, supports post-training quantization by performing calibration on the trained model, which quantizes the weights and activations. However, in some cases post-training quantization can degrade accuracy when converting a FP32 model to its INT8 counterpart. QAT introduces quantization ops to achieve higher accuracy by simulating the process for lower-precision quantization during training.