After clicking “Watch Now” you will be prompted to login or join.
Accelerate and Autoscale Deep Learning Inference on GPUs with KFServing
David Goodwin, NVIDIA | Dan Sun, Bloomberg
GTC 2020
Large-scale language models, such as BERT and GPT-2, have brought about exciting leaps in state-of-the-art accuracy for many NLP tasks. Due to its multi-head attention network, BERT requires significant compute during inference, which poses challenges for real-time application performance. KFServing provides model serving interfaces for common ML frameworks like TensorFlow, XGBoost, SKLearn, PyTorch, ONNX and NVIDIA's TensorRT. Built on Kubernetes CRDs and KNative, KFServing enables hardware acceleration and autoscaling of Bloomberg's own BERT models trained on a corpora of specialized financial news data. We'll discuss how the Bloomberg Data Science Platform uses KFServing to address latency and scalability in a production application. In addition to its scalability features, KFServing provides a standardized data plane across model frameworks and servers. We'll also present the community proposal for a v2 REST/gRPC data plane, along with its integration in Triton Inference Server.