GTC 2020: AdaSum: Adaptive Summation of Gradients for Deep Learning
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
AdaSum: Adaptive Summation of Gradients for Deep Learning
Saeed Maleki, Microsoft
AdaSum is a new algorithm to perform parallelized gradient aggregation based on the notion of sound model combiners. It brings accuracy of the parallel /distributed Stochastic Gradient Descent closer to the accuracy of sequential Stochastic Gradient Descent, yielding a faster convergence. We've added AdaSum as a new gradient aggregation option for Horovod and are currently working with the Horovod team in pushing a PR. The algorithm uses different communication strategies based on the underneath-hardware configurations — such as GPU, NVLink, and network topologies — optimizing for maximum bandwidth utilization. We'll discuss AdaSum in detail, go over an example with Horovod, and provide performance benchmarks demonstrating the total training-time savings for different deep-learning models.