GTC-DC 2019: Using the TensorRT Inference Server to Cut GPU costs and Simplify Model Deployment (Presented by CACI)

Kyle Pula, CACI

gtc-dc 2019

We’ll share knowledge derived from our experience scaling AI from research to production using the TensorRT Inference Server (TRTIS) in order to help engineers and managers dissatisfied with the complexity or cost efficiency of their model deployment architectures. We’ll show how TRTIS deploys a library of models across a collection of GPUs, reducing the amount of custom inference code requiring maintenance, and making better use of existing GPU resources. It also provides options for batching, ensemble models, and custom backends. Research and production teams will benefit from the ability to offload some of the inference complexity to TRTIS.