GTC-DC 2019: Multi-Node Training of Large Scale Language Models

Note: Viewing this video may require joining the NVIDIA Developer Program or login in

Purnendu Mukherjee, NVIDIA; Thor Johnsen, NVIDIA
We’ll explain how the training of large scale language models such as BERT, GPT-2, and XLNet requires massively parallel computation to achieve convergence within a reasonable amount of time. GPU-enabled multi-node training is necessary for these computation demands. We’ll present the tools we used to scale out the pre-training of these language models without losing accuracy. We used distributed training tools, improved optimization for large-batch training, automatic mixed precision, cluster managers, and GPU profiling tools like NSight. We’ll also discuss common bottlenecks and how to avoid them.

View more talks and sessions from this conference