After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Microsoft Open Sources Breakthrough Optimizations for Large-Scale BERT Models
Emma Ning, Microsoft | Nathan Yan, Microsoft
We'll talk about BERT (Bidirectional Encoder Representations from Transformers), which is a popular deep-learning model used for natural language processing. Inferencing BERT at high scale is traditionally extremely costly, though, and may not even be possible with strict latency constraints. We'll share examples of how we've scaled BERT with NVIDIA GPUs, such as Bing improving BERT inference for its real-time service needs, serving more than 1 million BERT inferences per second within Bing's latency limits. We will discuss open source, enhanced versions of these optimizations that are now available in ONNX Runtime and have been extended to work on both GPU and CPU. With ONNX Runtime, AI developers can now easily produce large transformer models with high performance across both CPU and GPU hardware, using the same technology Microsoft uses to serve their customers.