Multi-GPU Workflows for Training AI Models in Academic Research

By Sandra Skaff, Global Head of Academic Research Partnerships at NVIDIA

AI and deep learning have taken over today’s headlines with aspirations to solve some of humanity’s most complex and challenging problems. Scientists are looking to scale compute resources to reduce the time it takes to complete the training of neural networks and produce results in real time. Taking a Multi-GPU approach brings researchers closer to achieving a breakthrough as they can more rapidly experiment with different neural networks and algorithms.

Implementing a multi-GPU workflow is easier than you might anticipate. PyTorch and TensorFlow tools and libraries offer scalable distributed training and performance optimization for research and enterprise with a distributed backend.

How AI Researchers in Academia are Using Multi-GPU Workflows

We highlight a few research areas by our NVAIL, NVIDIA’s academic research partners, who are leveraging multi-GPU training in their research.

National Taiwan University has used multi-GPU training for their latest free-form video inpainting work, which was initially published in ICCV2019, with a later version published in BMVC2019. They proposed using a generator network with 3D gated convolutions to inpaint videos and a Temporal PatchGAN to enhance video quality and temporal consistency.

Multi-GPU training allowed for decreasing the training time by half, from 10-20 days to 5-10 days per model. The training was performed on 2 NVIDIA V100 GPUs, with 16 GB each. In the future, the authors anticipate needing more GPUs with more memory if they were to use higher resolution and longer frame sequences. The authors make the source code available on Github.

NTU’s video inpainting results

Oxford University performed multi-gpu training of Bayesian deep learning models for semantic segmentation. Their research proposed three metrics for evaluating uncertainty estimates of semantic class predictions obtained by Bayesian deep learning. The created two probabilistic versions of the DeepLab-v3+ network and trained it using model parallelism on a DGX-1 comprising of 8 NVIDIA P100 GPUs. raining took 3 days.

This work can be extended to develop metrics which account for making safe and correct autonomous driving decisions, as well as other applications. In addition to this work, Oxford continues to push the boundaries of several other deep learning research areas using multi-gpu training.

New York University benefitted from multi-GPU training in research using large scale NLP models. These models have used data parallelism for distributed training of single models over multiple GPUs.

In fact, one of their papers was motivated by the drawback of not being able to do distributed training to speed up convergence. The NYU team thus introduced non-autoregressive models which allow to parallelize forward and backward propagation in neural networks.

They showed in their EMNLP 2018 paper on “Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement”, that these models strike a good balance between performance and speed. They showed that the models can reach 90-95% performance of autoregressive models, while being 2-3 times faster and being more general as well. The code is available on Github.

In another ICML 2019 paper, the NYU team proposed a framework for training text generation models which operate in non-monotonic order. Such models are able to generate text without pre-specifying a generation order and can still achieve competitive performance compared to left-to-right generation. The models were trained on 4 NVIDIA TITAN-X GPUs for 48 hrs. These models were trained and tested for several tasks. For example, in machine translation, it was shown that the model learns to translate in a way which can preserve meaning but not word order. The code is available on Github.

Finally, the NYU team has a few research projects utilizing multi-gpu training underway, which will be published soon. For example, in one project they train a language model on 128 NVIDIA V100 GPUs on 86B amino acids across 250M sequences in order to predict biological structures. In a second project, they are training large models on multiple GPUs which learn to generate molecular conformations given a molecular graph.

These teams and several other academic continue to push the envelopes of this research as well as other research areas using multi-GPU training.

Looking to get started with your own multi-GPU workflow for your next AI project? Download our containers on the NVIDIA GPU cloud, NGC, to help you get started now.