After clicking “Watch Now” you will be prompted to login or join.
Optimization Strategies for Large-Scale DL Training Workloads: Case Study with RN50 on DGX Clusters
Mohammad Zulfiqar, NVIDIA | Joshua Mora Acosta, NVIDIA
GTC 2020
Our tutorial will expose a list of optimizations for large-scale DL training workloads. We'll give performance metrics and performance modeling of the deep-learning neural network as we scale the run, details on the executions at large scale, hardware subsystem's performance and software layers, paired with profiling tools (NVPROF,NSYS), NVTX tagging, profile logging considerations, profile parsing, visualizing and analyzing (for example, tradeoffs) the profiled information to identify the opportunities to improve the performance at large scale and to guide and prioritize the optimization efforts. We'll showcase those optimization strategies on training RN50 on large clusters of DGX1 and DGX2 machines up to 1,500 GPUs, which delivered a 2x performance improvement on the same amount of hardware. You need to be familiar with HW, SW, clusters, MPI, NCCL, profiling, deep-learning training, HPC, and performance metrics.