After clicking “Watch Now” you will be prompted to login or join.
High-Performance Inferencing at Scale Using the Triton Inference Server
David Goodwin, NVIDIA
GTC 2020
A critical task when deploying an inferencing solution at scale is to optimize latency and throughput to meet the solution's service level objectives. We'll discuss some of the capabilities provided by the NVIDIA Triton Inference Server that you can leverage to reach these performance objectives. These capabilities include: • Dynamic TensorFlow and ONNX model optimization using TensorRT • Inference compute optimization using advanced scheduling and batching techniques • Model pipeline optimization that communicates intermediate results via GPU memory • End-to-end solution optimization using system or CUDA shared memory to reduce network I/O. For all these techniques, we'll quantify the improvements by providing performance results using the latest NVIDIA GPUs.