GTC 2020: Optimization Strategies for Large-Scale DL Training Workloads: Case Study with RN50 on DGX Clusters
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Optimization Strategies for Large-Scale DL Training Workloads: Case Study with RN50 on DGX Clusters
Mohammad Zulfiqar, NVIDIA | Joshua Mora Acosta, NVIDIA
Our tutorial will expose a list of optimizations for large-scale DL training workloads. We'll give performance metrics and performance modeling of the deep-learning neural network as we scale the run, details on the executions at large scale, hardware subsystem's performance and software layers, paired with profiling tools (NVPROF,NSYS), NVTX tagging, profile logging considerations, profile parsing, visualizing and analyzing (for example, tradeoffs) the profiled information to identify the opportunities to improve the performance at large scale and to guide and prioritize the optimization efforts. We'll showcase those optimization strategies on training RN50 on large clusters of DGX1 and DGX2 machines up to 1,500 GPUs, which delivered a 2x performance improvement on the same amount of hardware. You need to be familiar with HW, SW, clusters, MPI, NCCL, profiling, deep-learning training, HPC, and performance metrics.