After clicking “Watch Now” you will be prompted to login or join.
Scaling the Transformer Model Implementation in PyTorch Across Multiple Nodes
Mohammad Zulfiqar, NVIDIA | Robert Knight, NVIDIA
GTC 2020
We'll dive deep behind the scenes into the Transformer model implementation in PyTorch to understand its performance weaknesses and work to make it scale across multiple nodes. We'll describe an analysis of system-level profiling data of an example Transformer workload, spanning multiple DGX-2 systems. We'll present the tools, collection methods, and data-analytics recipes, used to evaluate massive amounts of data and pinpoint the GPU/step of the algorithm causing issues. The described methodology can, in general, be applied to iterative DL and HPC workloads to achieve significant scaling gains.