After clicking “Watch Now” you will be prompted to login or join.
Optimizing Recommendation System Inference Performance Based on GPU
Xiaowei Shen, Alibaba | Wei Zhang, Alibaba Group
GTC 2020
We'll present a GPU-based system to speed up recommendation system inference performance. Neural network-based recommendation models have been widely applied on tracking personalization and recommendation tasks at large internet companies, such as e-commerce and social media companies. Alibaba's recommendation system deploys wide and deep learning models for product recommendation tasks. The more products and users the model needs to rank, the more the feature length and batch size of the models increase. The computation of models is also increased so that traditional model inference implementation on CPU cannot meet the requirement of query-per-second and latency-of-recommendation tasks. With model quantization and graph transformation, we can speed up performance by 3.9x compared with a baseline GPU implementation.