After clicking “Watch Now” you will be prompted to login or join.
Deep into Triton Inference Server: BERT Practical Deployment on NVIDIA GPU
Tianhao Xu, NVIDIA
GTC 2020
We'll give an overview of the TensorRT Hyperscale Inference Platform. We start with a deep dive into current features and internal architecture, then go into deployment possibilities in a generic deployment ecosystem. Next, we'll give a hands-on overview of NVIDIA Bert, FasterTransformer and TRT-optimized BERT inference. Then we'll get into how to deploy BERT TensorFlow model with custom op, how to deploy BERT TensorRT model with plugins, and benchmarking. We'll finish with other optimization techniques and open discussion.