AI / Deep Learning

Scaling Language Model Training to a Trillion Parameters Using Megatron

Apr 12, 2021

By Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand and Bryan Catanzaro

Discuss (1)

AI-Generated Summary

Dislike

The training of large NLP models is challenging due to memory constraints and high computational requirements, but techniques like tensor and pipeline model parallelism can help.
By combining tensor model parallelism within a node and pipeline model parallelism across nodes with data parallelism, it's possible to scale up to models with a trillion parameters on NVIDIA's Selene supercomputer.
Using 8-way tensor parallelism and 8-way pipeline parallelism on 1024 A100 GPUs, a GPT-3 model with 175 billion parameters can be trained in just over a month, achieving an end-to-end per GPU throughput of 163 teraFLOPs.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Natural Language Processing (NLP) has seen rapid progress in recent years as computation at scale has become more available and datasets have become larger. At the same time, recent work has shown large language models to be effective few-shot learners, with high accuracy on many NLP datasets without additional finetuning. As a result, state-of-the-art NLP models have grown at an exponential rate (Figure 1). Training such models, however, is challenging for two reasons:

It is no longer possible to fit model parameters in the main memory of even the largest GPU.
Even if the model can be fitted in a single GPU (for example, by swapping parameters between host and device memory), the high number of compute operations required can result in unrealistically long training times without parallelization. For example, training a GPT-3 model with 175 billion parameters would take 36 years on eight V100 GPUs, or seven months with 512 V100 GPUs.

In our previous post on Megatron, we showed how tensor (intralayer) model parallelism can be used to overcome these limitations. Although this approach works well for models of sizes up to 20 billion parameters on DGX A100 servers (with eight A100 GPUs), it breaks down for larger models. Larger models need to be split across multiple DGX A100 servers, which leads to two problems:

The all-reduce communication required for tensor model parallelism needs to go through inter-server links, which are slower than the high-bandwidth NVLink links available within a DGX A100 server
A high degree of model parallelism can lead to small GEMMs, potentially decreasing GPU utilization.

To overcome these limitations, we combined tensor model parallelism with pipeline (interlayer) (model) parallelism. Pipeline parallelism was initially used in PipeDream and GPipe, and is now also available in systems such as DeepSpeed. We used tensor model parallelism inside a DGX A100 server and pipeline parallelism across DGX A100 servers. Figure 2 shows this combination of tensor and pipeline model parallelism. By combining these two forms of model parallelism with data parallelism, we can scale up to models with a trillion parameters on the NVIDIA Selene supercomputer (Figure 3). Models in this post are not trained to convergence. We only performed a few hundred iterations to measure time per iteration.

We saw an aggregate throughput improvement of 114x when moving from a ~1-billion-parameter model on 32 GPUs to a ~1-trillion-parameter model on 3072 A100 GPUs. Using 8-way tensor parallelism and 8-way pipeline parallelism on 1024 A100 GPUs, the GPT-3 model with 175 billion parameters can be trained in just over a month. On a GPT model with a trillion parameters, we achieved an end-to-end per GPU throughput of 163 teraFLOPs (including communication), which is 52% of peak device throughput (312 teraFLOPs), and an aggregate throughput of 502 petaFLOPs on 3072 A100 GPUs.

Our implementation is open source on the NVIDIA/Megatron-LM GitHub repository, and we encourage you to check it out! In this post, we describe the techniques that allowed us to achieve these results. For more information, see our paper, Efficient Large-Scale Language Model Training on GPU Clusters.

Pipeline parallelism

With pipeline parallelism, the layers of a model are partitioned across multiple devices. When used on repetitive transformer-based models, each device can be assigned an equal number of transformer layers. A batch is split into smaller microbatches; execution is then pipelined across microbatches. To retain vanilla optimizer semantics, we introduce periodic pipeline flushes so that optimizer steps are synchronized across devices. At the start and end of every batch, devices are idle. We call this idle time the pipeline bubble and want to make it as small as possible.

There are several possible ways of scheduling forward and backward microbatches across devices, and each approach offers different tradeoffs between pipeline bubble size, amount of communication, and memory footprint. We discuss two such approaches in this post.

Default schedule

GPipe proposes a schedule where the forward passes for all microbatches in a batch are first executed, followed by backward passes for all microbatches. Figure 4 shows a picture of this schedule. This approach has a high memory footprint as it requires stashed intermediate activations (or just input activations for each pipeline stage when using activation recomputation) for all microbatches in a batch. For the large batch sizes that are typically required to amortize away the cost of the pipeline bubble, this schedule is impractical due to its high memory footprint.

In Figure 4, for simplicity, we assumed that the backward pass takes twice as long as forward pass (wgrad, dgrad). The efficiency of the pipeline schedule is independent of this ratio. Each batch in this example consists of eight microbatches, and the number in each box is a unique identifier given to a microbatch. The optimizer is stepped, and weight parameters updated at the pipeline flush.

Instead, we used the PipeDream-Flush schedule. In this schedule, workers first enter a warm-up phase (Figure 5). This schedule limits the number of in-flight microbatches (the number of microbatches for which the backward pass is outstanding and activations need to be maintained) to the depth of the pipeline, instead of the number of microbatches in a batch. After the warm-up phase, each worker then enters steady state, where workers perform one forward pass followed by one backward pass (1F1B for short). Finally, at the end of a batch, workers complete backward passes for all remaining in-flight microbatches.

You can quantify the pipeline bubble size (\(t_{pb}\)). Denote the number of microbatches in a batch as \(m\), the number of pipeline stages (equal to the number of devices used for pipeline model parallelism) as \(p\), the ideal time per iteration as \(t_{id}\), and the time to execute a single microbatch’s forward and backward pass as \(t_f\) and \(t_b\). For the schedules in Figures 4 and 5, the pipeline bubble consists of \(p-1\) forward passes at the start of a batch, and \(p-1\) backward passes at the end. The total amount of time spent in the pipeline bubble is then:

\(t_{pb} = (p-1) \cdot (t_f + t_b)\)

The ideal processing time for a batch is:

\(t_{id} = m \cdot (t_f + t_b)\)

Therefore, the fraction of total time spent in the pipeline bubble can be computed as:

\(\text{bubble time fraction} = \dfrac{t_{pb}}{t_{id}} = \dfrac{p-1}{m}\)

For the bubble time fraction to be small, \(m\) must be much larger than \(p\). The number of outstanding forward passes in this schedule is at most the number of pipeline stages. As a result, this schedule requires activations to be stashed for \(p\) or fewer microbatches (compared to \(m\) microbatches for the schedule in Figure 4). When \(m \gg p\), the schedule in Figure 5 is consequently much more memory-efficient than the one in Figure 4. Both schedules have the same pipeline bubble size.

Interleaved schedule

To reduce the size of the pipeline bubble, each device can perform computation for multiple subsets of layers (called a model chunk), instead of a single contiguous set of layers. For example, four layers of computation on each device can be split into two model chunks, each with two layers, instead of each device having a contiguous set of layers (device 1 has layers 1-4, device 2 has layers 5-8, and so on). With this scheme, each device in the pipeline is assigned multiple stages.

As before, you can use an all-forward, all-backward version of this schedule, but this has a high memory footprint. Instead, we developed a more memory-efficient, 1F1B version of this interleaved schedule. This new schedule is shown in Figure 6 and requires the number of microbatches in a batch to be an integer multiple of the degree of pipeline parallelism (number of devices in the pipeline). For example, with four devices, the number of microbatches in a batch must be a multiple of 4.

As shown in Figure 6, the pipeline flush for the same batch size happens sooner in the new schedule. If each device has \(v\) stages (or model chunks), then the forward and backward time for a microbatch for each stage will now be \(t_f / v\) and \(t_b / v\), respectively. Thus, the pipeline bubble time reduces to:

\(\text{bubble time} = \dfrac{(p-1)\cdot(t_f + t_b)}{v}\)

This reduced pipeline bubble size, however, does not come for free: this schedule requires extra communication. Quantitatively, the amount of communication increases by \(v\) since the total number of stages also increases by \(v\). In the next section, we discuss how to use the eight InfiniBand networking cards in a DGX A100 node to reduce the impact of this extra communication.

Optimized internode communication using DGX A100 8 InfiniBand networking cards

In a pipelined setup, tensors must be sent and received in the forward and backward direction in parallel. Each DGX A100 is equipped with eight InfiniBand (IB) networking cards. Unfortunately, sends and receives are point-to-point, and only happen between a pair of GPUs on two servers. This makes it hard to leverage all eight cards in a single communication call.

However, you can make use of the fact that you want to use both tensor model parallelism and pipeline model parallelism to reduce the overhead of cross-node communication. The output of each transformer layer is replicated across the tensor-parallel ranks (Figure 4 from the Megatron paper). As a result, ranks in two consecutive pipeline stages that are performing tensor model parallelism send and receive the exact same set of tensors.

For large enough models, we used a tensor-model-parallel size of 8. This means that the same set of tensors are sent eight times between corresponding GPUs on adjacent multi-GPU servers. To reduce this redundancy, we instead split the tensor on the send side into equal-sized chunks and then only sent the chunk to the corresponding rank on the next node, using the rank’s dedicated IB card. With eight tensor-model-parallel ranks, each chunk is 8x smaller.

On the receiver side, to re-materialize the full tensor, we performed an all-gather over NVLink, which is much faster than the IB interconnect (Figure 7). We call this the scatter/gather communication optimization. This optimization helps better leverage the multiple IB cards on the DGX A100 servers and makes more communication-intensive schedules feasible, such as the interleaved one. The chunking optimization used here is like the activation partitioning technique to reduce memory footprint and pipeline-parallel communication from ZeRO and DeepSpeed.

Performance microbenchmarks for pipeline parallelism

In this section, we evaluated the computational performance of these pipeline-parallel schemes. This section does not use data parallelism, but we show results with both data and model parallelism later in this post.

Weak-scaling of pipeline parallelism

We first evaluated the weak-scaling performance of the default non-interleaved pipeline-parallel schedule using a GPT model with 128 attention heads, a hidden size of 20480, and a microbatch size of 1. As we increased the number of pipeline stages, we also increased the size of the model by proportionally increasing the number of layers in the model. For example, with a pipeline-parallel size of 1, we used a model with three transformer layers and ~15 billion parameters. With a pipeline-parallel size of 8, we used a model with 24 transformer layers and ~121 billion parameters. We used a tensor-parallel size of 8 for all configurations and varied the total number of A100 GPUs used from 8 to 64.

Figure 8 shows throughput per GPU for two different batch sizes. The peak device throughput of an A100 GPU is 312 teraFLOPs. As expected, the higher batch size scales better because the pipeline bubble is amortized over more microbatches (equal to batch size).

The number of floating point operations (numerator of throughput) is computed analytically based on the model architecture, taking activation recomputation into account.

Tensor vs. pipeline parallelism

We also evaluated the impact of parallelization configuration on performance for a given model and batch size. The empirical results in Figure 9 show the importance of using both tensor and pipeline model parallelism in conjunction to train a 161-billion-parameter GPT model (32 transformer layers, 128 attention heads, hidden size of 20480) with low communication overhead and high compute resource utilization. Tensor model parallelism is best within a node (DGX A100 server) due to its expensive all-reduce communication.

Pipeline model parallelism, on the other hand, uses much cheaper point-to-point communication that can be performed across nodes without bottlenecking the entire computation. However, pipeline parallelism can spend significant time in the pipeline bubble. The total number of pipeline stages should thus be limited so that the number of microbatches in the pipeline is a reasonable multiple of the number of pipeline stages. Consequently, we saw peak performance when the tensor-parallel size was equal to the number of GPUs in a single node (8 on Selene with DGX A100 nodes).

Scatter-gather optimization for communication

Figure 10 shows per-GPU throughput with and without (unoptimized) the scatter/gather communication optimization for a GPT model with 175 billion parameters (96 attention heads, hidden size of 12288, and 96 transformer layers). End-to-end throughput improves by up to 11% for communication-intensive schedules (large batch size with interleaving). This highlights the importance of the DGX A100 eight IB cards in achieving high training throughput for large models.

Interleaved vs. non-interleaved schedule

Figure 11 shows the per-GPU throughput for interleaved and non-interleaved schedules on the same GPT model with 175 billion parameters. The interleaved schedule with the scatter/gather communication optimization has higher computational performance than the non-interleaved (default) schedule. This gap closes as the batch size increases, due to two reasons:

As the batch size increases, the bubble size in the default schedule decreases.
The amount of point-to-point communication within the pipeline is proportional to the batch size. Consequently, the non-interleaved schedule catches up as the amount of communication increases.

Without the scatter/gather optimization, the default schedule performs better than the interleaved schedule at larger batch sizes (not shown).

End-to-end scaling using model and data parallelism

Consider the weak-scaling performance of Megatron on GPT models ranging from a billion to a trillion parameters. We used tensor, pipeline, data parallelism, and the interleaved pipeline schedule with the scatter/gather optimization enabled. All models used a vocabulary size of 51,200 (multiple of 1024) and a sequence length of 2048. We varied the hidden size, number of attention heads, and number of layers to arrive at a specific model size. As the model size increased, we also increased the batch size and the number of GPUs.

Table 1 shows the model configurations along with the achieved FLOPs per second (both per GPU and aggregate over all GPUs). We saw almost-perfect linear scaling to 3072 A100 GPUs (384 DGX A100 nodes), as shown in Figure 1. Throughput is measured for end-to-end training: all operations including data loading, optimization, and logging. The largest case achieves 52% of the theoretical peak FLOPs.

Model size	Hidden size	Number of layers	Model-parallel size	Number of GPUs	Batch size	Achieved teraFlOPs per GPU
1.7B	2304	24	1	32	512	137
3.6B	3072	30	2	64	512	138
7.5B	4096	36	4	128	512	142
18B	6144	40	8	256	1024	135
39B	8192	48	16	512	1536	138
76B	10240	60	32	1024	1792	140
145B	12288	80	64	1536	2304	148
310B	16384	96	128	1920	2160	155
530B	20480	105	280	2520	2520	163
1T	25600	128	512	3072	3072	163

Table 1. Weak-scaling throughput for GPT-3 models ranging from 1 billion to 1 trillion parameters.

Finally, based on the measured throughputs from Table 1, you can estimate the training time. The time required to train a GPT-based language model with \(P\) parameters using \(T\) tokens on \(N\) GPUs with per-GPU throughput of \(X\) can be estimated as follows:

\(\text{Training time (seconds)} \approx 8 \cdot \dfrac{TP}{NX}.\)

For the 1 trillion parameter model, assume that you need about 450 billion tokens to train the model. Using 3072 A100 GPUs with 163 teraFLOPs / GPU, you require:

\(\text{1T model training time} \approx 8 \cdot \dfrac{450 \times 10^9 \times 1008 \times 10^9}{3072 \times 163 \times 10^{12}} \approx 84 \text{ days},\)

This is less than three months. The GPT-3 model with 175 billion parameters requires just over a month to train using 1024 A100 GPUs. These results indicate that it is feasible to train such large models in a reasonable amount of time with this system. For more information, see Efficient Large-Scale Language Model Training on GPU Clusters.

Summary

In this post, we outlined various techniques that facilitate the training of NLP models with up to a trillion parameters, using a smart combination of different parallelization strategies:

Intranode tensor model parallelism
Internode pipeline model parallelism
Data parallelism

Going forward, we want to further optimize our pipelining schedules. For example, we expect a throughput improvement if we can compute data gradients before weight gradients, as tensors can be sent upstream earlier, leading to a cheaper pipeline flush. We also want to explore the tradeoffs associated with hyperparameters such as microbatch size, global batch size, and the degree of activation recomputation on throughput. Finally, we want to train models to convergence, and better understand the implications of using schedules without pipeline flushes, such as PipeDream-2BW, which has relaxed weight update semantics.

Discuss (1)

About the Authors

About Deepak Narayanan
Deepak Narayanan is a senior applied deep learning research scientist in the ADLR group at NVIDIA, where he looks at making the training and inference of LLMs faster and more reliable. He holds a PhD in Computer Science from Stanford University.

View all posts by Deepak Narayanan

About Mohammad Shoeybi
Mohammad Shoeybi is a senior research scientist and manages the NLP team within the Applied Deep Learning Research group at NVIDIA. His team focuses on language modeling, NLP applications such as question answering and dialogue systems, and large-scale training. He received his PhD. from Stanford University in 2010. Prior to NVIDIA, he worked at DeepMind and Baidu USA leading efforts on bringing deep learning and reinforcement learning to applications.

View all posts by Mohammad Shoeybi

About Jared Casper
Jared works as a Senior Deep Learning Scientist in the Applied Deep Learning Research team at NVIDIA. Prior to joining NVIDIA in 2017, Jared worked on systems for deep learning at Baidu’s Silicon Valley AI Lab and on domain-specific hardware accelerators at Oracle Labs. Jared received his Ph.D. from Stanford in 2015 focusing on Computer Architecture.

View all posts by Jared Casper

About Patrick LeGresley
Patrick LeGresley is a researcher in the NLP group within Applied Deep Learning Research at NVIDIA, where he focuses on system aspects of training large language models. He received his PhD in Aeronautics and Astronautics from Stanford University. In the past, he also worked at the Baidu Silicon Valley AI Lab, working on systems research and deep learning for speech recognition.

View all posts by Patrick LeGresley

About Mostofa Patwary
Mostofa Patwary is a senior deep learning research scientist at the Applied Deep Learning Research team at NVIDIA. Mostofa's research interests span in the areas of natural language processing, scalable deep learning, HPC, and algorithm engineering. Prior to joining NVIDIA, Mostofa worked on scaling large language models and the predictability of scaling deep learning applications at Baidu's Silicon Valley AI Lab. Mostofa also made significant contributions in developing large-scale code for several core kernels in machine learning capable of running on supercomputers.

View all posts by Mostofa Patwary

About Vijay Korthikanti
Vijay Korthikanti is a senior research scientist in the Applied Deep Learning Research group at NVIDIA. Vijay's research interests are in areas of scalable deep learning, HW-SW codesign, parallel algorithms, and formal verification. Prior to joining NVIDIA, Vijay worked at two HW startups, Nervana Systems and Cerebras Systems, focusing on parallel algorithm development, HW-SW codesign, and end-to-end deep learning acceleration. Earlier in his career, he spent a few years at Synopsys on their formal verification team. Vijay received his Ph. D from UIUC in 2011 focusing on energy-performance tradeoffs of parallel applications.

View all posts by Vijay Korthikanti

About Dmitri Vainbrand
Dmitri Vainbrand is a senior deep learning architect at GPU DL Architecture group at NVIDIA. Dmitri is working on GPU HW and system architecture, focusing on large deep learning networks, systems, and scale-out. Dmitri received his MSc from Technion in 2011 in the field of network-on-chip architectures for neural networks. Prior to NVIDIA, he worked at Intel, where he took part in the development of five generations of CPU Core processors and Intel’s first AI accelerator.

View all posts by Dmitri Vainbrand

About Bryan Catanzaro
Bryan Catanzaro is vice president of Applied Deep Learning Research at NVIDIA, where he leads a team finding new ways to use AI to improve projects ranging from language understanding to computer graphics and chip design. Bryan's research at NVIDIA led to the creation of CUDNN, and more recently, he helped lead the team that invented DLSS 2.0. Prior to NVIDIA, he worked at Baidu to create next-generation systems for training and deploying end-to-end, deep learning-based speech recognition. Bryan received his PhD in Electrical Engineering and Computer Sciences from the University of California, Berkeley.

View all posts by Bryan Catanzaro