After clicking “Watch Now” you will be prompted to login or join.
Integer Quantization for DNN Inference Acceleration
Patrick Judd, NVIDIA
GTC 2020
While neural networks are typically trained in floating-point formats, inference/serving can often use integer arithmetic after neural networks are quantized. Benefits of quantized inference include reduced memory requirements, as well as the use of faster math pipelines. For example, NVIDIA's Tensor Cores provide int8, int4, and int1 math units, which have 4x, 8x, and 32x more math bandwidth than fp32. We'll detail various options for quantizing a neural network for inference while maintaining model accuracy. We'll review results for networks trained for various tasks (computer vision, language, speech) and varying architectures (CNNs, RNNs, Transformers).