GTC 2020: Integer Quantization for DNN Inference Acceleration
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Integer Quantization for DNN Inference Acceleration
Patrick Judd, NVIDIA
While neural networks are typically trained in floating-point formats, inference/serving can often use integer arithmetic after neural networks are quantized. Benefits of quantized inference include reduced memory requirements, as well as the use of faster math pipelines. For example, NVIDIA's Tensor Cores provide int8, int4, and int1 math units, which have 4x, 8x, and 32x more math bandwidth than fp32. We'll detail various options for quantizing a neural network for inference while maintaining model accuracy. We'll review results for networks trained for various tasks (computer vision, language, speech) and varying architectures (CNNs, RNNs, Transformers).