GTC 2020: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
After clicking “Watch Now” you will be prompted to login or join.
Click “Watch Now” to login or join the NVIDIA Developer Program.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mostofa Patwary, NVIDIA | Raul Puri, NVIDIA
We'll cover an efficient model parallel approach by making only a few targeted modifications to existing PyTorch transformer implementations. Training the largest neural language model has recently been the best way to advance the state of the art in NLP applications. However, for models beyond a billion parameters, a single GPU doesn't have enough memory to fit the model along with the training parameters, requiring model parallelism to split the parameters across multiple GPUs. We'll showcase our approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer-based language model ever trained. This model establishes new state-of-the-art results in downstream tasks.