GTC 2020: Performance Optimization on Quantized Deep-Learning Models
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Performance Optimization on Quantized Deep-Learning Models
Zhenyu Gu, Alibaba Group
Quantization (8-bit) has been broadly adapted in computer vision-related deep-learning models for better inference performance. We present a set of techniques to speed up inference performance on a quantized model (8-bit). At graph level, we proposed a quantization-aware global layout transformation and graph optimization to minimize the data-layout transformation between 32-bit float and 8-bit integer. At the kernel level, we proposed an algorithm to fused the IM2COL with GEMM to save both the GPU memory usage and CUDA launch time caused by the process to generate IM2COL matrix. In addition, we proposed a double-buffering technique to improve the concurrency and reduce the data dependency. By comparing with cuDNN 7.1, our proposed method got up to 5x performance improvement.