After clicking “Watch Now” you will be prompted to login or join.
Accelerating GNMT Inference on GPU
Maxim Milakov, NVIDIA | Jeremy Appleyard, NVIDIA
GTC 2020
Google Neural Machine Translation (GNMT) is one of the benchmarks in the MLPerf inference benchmark suite, representing Seq2Seq models. The benchmark measures throughput under latency constraints. We'll go through the challenges that we, at NVIDIA, faced when implementing the GNMT benchmark and how we solved them with NVIDIA GPUs using the optimized and customizable TensorRT library. You'll learn the tricks we used to optimize the GNMT model, many of which are applicable to other auto-regressive models and to DL inference on GPU in general.