GTC 2020: Integer Quantization for DNN Inference Acceleration

After clicking “Watch Now” you will be prompted to login or join.

Click “Watch Now” to login or join the NVIDIA Developer Program.

WATCH NOW

Integer Quantization for DNN Inference Acceleration

Patrick Judd, NVIDIA

GTC 2020

While neural networks are typically trained in floating-point formats, inference/serving can often use integer arithmetic after neural networks are quantized. Benefits of quantized inference include reduced memory requirements, as well as the use of faster math pipelines. For example, NVIDIA's Tensor Cores provide int8, int4, and int1 math units, which have 4x, 8x, and 32x more math bandwidth than fp32. We'll detail various options for quantizing a neural network for inference while maintaining model accuracy. We'll review results for networks trained for various tasks (computer vision, language, speech) and varying architectures (CNNs, RNNs, Transformers).

View More GTC 2020 Content