Note: This video may require joining the NVIDIA Developer Program or login

GTC Silicon Valley-2019 ID:S9659:Inference at Reduced Precision on GPUs

Although neural network training is typically done in either 32- or 16-bit floating point formats, inference can be run at even lower precisions that reduce memory footprint and elapsed time. We'll describe quantizing neural networks models for various image (classification, detection, segmentation) and natural language processing tasks. In addition to convolutional feed forward networks, we will cover quantization of recurrent models. The discussion will examine both floating point and integer quantizations, targeting features in Volta and Turing GPUs.

View the slides (pdf)