After clicking “Watch Now” you will be prompted to login or join.
AdaSum: Adaptive Summation of Gradients for Deep Learning
Saeed Maleki, Microsoft
GTC 2020
AdaSum is a new algorithm to perform parallelized gradient aggregation based on the notion of sound model combiners. It brings accuracy of the parallel /distributed Stochastic Gradient Descent closer to the accuracy of sequential Stochastic Gradient Descent, yielding a faster convergence. We've added AdaSum as a new gradient aggregation option for Horovod and are currently working with the Horovod team in pushing a PR. The algorithm uses different communication strategies based on the underneath-hardware configurations — such as GPU, NVLink, and network topologies — optimizing for maximum bandwidth utilization. We'll discuss AdaSum in detail, go over an example with Horovod, and provide performance benchmarks demonstrating the total training-time savings for different deep-learning models.