GTC 2020: Accelerating GNMT Inference on GPU
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Accelerating GNMT Inference on GPU
Maxim Milakov, NVIDIA | Jeremy Appleyard, NVIDIA
Google Neural Machine Translation (GNMT) is one of the benchmarks in the MLPerf inference benchmark suite, representing Seq2Seq models. The benchmark measures throughput under latency constraints. We'll go through the challenges that we, at NVIDIA, faced when implementing the GNMT benchmark and how we solved them with NVIDIA GPUs using the optimized and customizable TensorRT library. You'll learn the tricks we used to optimize the GNMT model, many of which are applicable to other auto-regressive models and to DL inference on GPU in general.