GTC Silicon Valley-2019: Inference at Reduced Precision on GPUs
GTC Silicon Valley-2019 ID:S9659:Inference at Reduced Precision on GPUs
Although neural network training is typically done in either 32- or 16-bit floating point formats, inference can be run at even lower precisions that reduce memory footprint and elapsed time. We'll describe quantizing neural networks models for various image (classification, detection, segmentation) and natural language processing tasks. In addition to convolutional feed forward networks, we will cover quantization of recurrent models. The discussion will examine both floating point and integer quantizations, targeting features in Volta and Turing GPUs.