Purnendu Mukherjee, NVIDIA; Thor Johnsen, NVIDIA
We’ll explain how the training of large scale language models such as BERT, GPT-2, and XLNet requires massively parallel computation to achieve convergence within a reasonable amount of time. GPU-enabled multi-node training is necessary for these computation demands. We’ll present the tools we used to scale out the pre-training of these language models without losing accuracy. We used distributed training tools, improved optimization for large-batch training, automatic mixed precision, cluster managers, and GPU profiling tools like NSight. We’ll also discuss common bottlenecks and how to avoid them.