GTC 2020: Optimization Strategies for Large-Scale DL Training Workloads: Case Study with RN50 on DGX Clusters

After clicking “Watch Now” you will be prompted to login or join.

Click “Watch Now” to login or join the NVIDIA Developer Program.

WATCH NOW

Optimization Strategies for Large-Scale DL Training Workloads: Case Study with RN50 on DGX Clusters

Mohammad Zulfiqar, NVIDIA | Joshua Mora Acosta, NVIDIA

GTC 2020

Our tutorial will expose a list of optimizations for large-scale DL training workloads. We'll give performance metrics and performance modeling of the deep-learning neural network as we scale the run, details on the executions at large scale, hardware subsystem's performance and software layers, paired with profiling tools (NVPROF,NSYS), NVTX tagging, profile logging considerations, profile parsing, visualizing and analyzing (for example, tradeoffs) the profiled information to identify the opportunities to improve the performance at large scale and to guide and prioritize the optimization efforts. We'll showcase those optimization strategies on training RN50 on large clusters of DGX1 and DGX2 machines up to 1,500 GPUs, which delivered a 2x performance improvement on the same amount of hardware. You need to be familiar with HW, SW, clusters, MPI, NCCL, profiling, deep-learning training, HPC, and performance metrics.

View More GTC 2020 Content