Note: This video may require joining the NVIDIA Developer Program or login
GTC Silicon Valley-2019 ID:S9422:An Automatic Batching API for High Performance RNN Inference
We will describe a new API that more effectively utilizes the GPU hardware for multiple single inference instances of the same RNN model. Many NLP applications have real-time run time requirements for multiple independent inference instances. Our proposed API accepts independent inference requests from an application and seamlessly combines them to a large batch execution. Time steps from independent inference tasks are combined together so that we achieve high performance while staying within the latency budgets of an application for a time step. We also discuss functionality that allows the user to wait on completion of a certain time step, a task that's possible because our implementation is mainly composed of non-blocking function calls. Finally, we'll present performance data from the Turing architecture for an example RNN model with LSTM cells and projections.