After clicking “Watch Now” you will be prompted to login or join.
Distributed Deep Learning with Horovod
Travis Addair, Uber Technologies
GTC 2020
We'll show how to scale distributed training of TensorFlow, PyTorch, and MXNet models with Horovod, a library designed to make distributed training fast and easy to use. We'll explain the role of Horovod in taking a model designed on a single GPU and training it on a cluster of GPU servers with just a few additional lines of Python code. We'll also explore how Horovod has been used across the industry to scale training to hundreds of GPUs, and the techniques that are used to maximize training performance. Although frameworks like TensorFlow and PyTorch simplify the design and training of deep learning models, difficulties usually arise when scaling models to multiple GPUs in a server or multiple servers in a cluster.