Recommendation systems drive engagement on many of the most popular online platforms. With the rapidly growing amounts of data available to power these systems, data scientists are migrating from traditional machine learning methods to more expressive deep learning models in order to improve the quality of their recommendations.

The initial cost and latency induced by the complexity of deep learning models can be daunting for recommender inference applications operating under tight cost and latency budgets. We demonstrate that a mixed precision inference implementation, optimized for NVIDIA GPUs drastically reduces latency and at the same time improving the cost per inference. This paves the way for fast, low-cost, scalable recommendation systems well suited to both online and offline deployment.

We leverage the common architecture of Wide and Deep, as well as the NVIDIA TensorRT inference engine inside Triton Inference Server, for a production-quality inference deployment. This solution provides low latency via REST and gRPC APIs using model concurrency and batching. Optimally leveraging the GPU the implementation reaches new levels of performance for deep learning based recommender models.

Join now