Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

This post introduces Dynamic Context Parallelism (Dynamic-CP), a scheduling approach in NVIDIA Megatron Core used for LLM post-training or DiT pre-training. It dynamically selects the CP size per microbatch to efficiently handle variable-length sequences, achieving up to 1.48x speedup on real-world datasets.

In large-scale model training, an often-overlooked bottleneck arises from the sequence-length variability in real-world datasets. Both LLM training and large-scale video generation have clear long-tail distributions in sequence length. A small fraction of ultra-long samples accounts for a disproportionately large share of the computational workload and memory consumption

In LLM training, this leads to wide-ranging text sequence lengths across batches. In video generation, high-resolution, multi-second videos can span tens of thousands of tokens. This results in imbalanced sample-level FLOPs and memory usage across data-parallel ranks, modalities, and micro-batches, hindering efficient scheduling and resource utilization.

To manage variable-length inputs, training systems commonly use sample-level packing, which combines multiple shorter sequences into a single micro-batch whose total token length is bounded by a target sequence length. In Figure 1, the sequences are packed to an equal length.

Even though the three packed samples are the same length, their compute workloads aren’t equivalent, as shown in Figure 2, due to the quadratic nature of dot product attention. This variation in compute workload across packed samples is known as data-parallel (DP) computational imbalance. This imbalance causes GPU idling, as some DP ranks wait for others with higher compute workloads to perform gradient synchronization. It also exacerbates pipeline-parallel (PP) bubbles.

In Figure 3, NVIDIA Nsight Systems profile captures an imbalance in VLM training. Different image/video samples have variable sequence length, and packing is employed. The capture shows sync overhead across different DP groups.

Also, when using context parallelism, the CP sharding size is determined by the longest sequence in the batch to avoid out-of-memory errors across GPUs. As a result, shorter sequences that don’t require context parallelism are sharded. Even though these sequences fit on a single GPU, they’re partitioned due to a longer sequence in the same batch, resulting in unnecessary CP communication overhead.

Usually, computation hides CP communication. However, when CP sizes are large—especially when communication spans InfiniBand (IB) domains—communication overhead can become exposed when packed sequences are shorter, and the compute workload is smaller. This is CP computational inefficiency.

The following shows an example where TP2CP8 is required due to the large total sequence length. Many packed sequences are made of smaller sub-sequences and don’t have enough compute to hide the CP communications.

These observations show the need for a dynamic approach to context parallelism. Instead of statically fixing the CP size to the longest sequence in a micro-batch, this approach adapts the CP size using the packing strategy per micro-batch. Relevant work, such as ByteScale and WLB-LLM, addresses similar problems.

Switching the CP size requires re-partitioning the sequence slices and re-forming the CP communication groups used by attention operations. Compared to alternative dynamic-parallelism schemes—such as adapting tensor-parallel or pipeline-parallel sizes based on sequence length—Dynamic-CP adds minimal overhead, because resizing TP/PP requires weight redistribution or pipeline graph restructuring, which are expensive.

The solver is designed to, given a set of variable-length sequences, determine how to pack them and select the CP size to maximize computational efficiency without exceeding GPU memory limits. The solver’s function is to take variable-length sequences and calculate the optimal packing and CP size. This determination maximizes computational efficiency while strictly adhering to GPU memory constraints. By modeling compute and communication costs, the solver avoids over-sharding short sequences and unnecessary CP communication, mitigating data-parallel imbalances and CP inefficiency.

The following example shows the benefit of using Dynamic-CP. Before applying workload balancing, the imbalance leads to pipeline bubbles across different micro-batches, which further causes DP imbalance across DP ranks. After balancing, the bubbles across micro-batches and DP ranks reduce.

Megatron Core framework modifications for supporting Dynamic-CP

This section introduces the pipeline of Dynamic CP integration into Megatron Core.

Building multiple context parallel groups per rank

With standard context parallelism, each rank belongs to a single group (cp_group) with a fixed cp_size that is statically determined during initialization. However, dynamic context parallelism has a different cp_size across iterations and microbatches.

To support this, a single rank must participate in multiple CP groups of different sizes. Multiple CP groups are constructed during initialization, with cp_size ranging from 1 up to dp × cp, restricted to powers of two. This design enables selecting the appropriate CP group at runtime based on the packing and scheduling result, without the overhead of dynamically creating communication groups.

Dynamic rescheduling and packing data

Unlike pretraining, which typically uses the Batch × Sequence × Head × Dim (BSHD) layout, Dynamic-CP operates on a THD layout. In this format, variable-length sequences are packed together under a length constraint, collapsing the original BS dimension into a token T dimension.

As a consequence, the number of micro-batches is no longer static. In the BSHD, the number of micro-batches is given by num_micro_batches = global_batch_size/dp_size/micro_batch_size.

With THD packing, the number of original sequences contained in each packed sequence isn’t fixed, causing num_micro_batches to vary across iterations.

Megatron Core provides multiple training schedulers depending on whether pipeline parallelism (PP/VPP) is enabled. To minimize invasive changes to the existing scheduling logic, a lightweight data_iterator_wrapper around the original data_iterator is introduced. It performs three steps:

Rescheduling and packing sequences in the global batch create a balanced workload across DP ranks.
Selecting an appropriate cp_size based on the packing result to minimize CP comm inefficiency.
Returning the effective num_micro_batches for the current iteration.

With this approach, Dynamic-CP support is added to all schedulers by inserting a single wrapper, keeping the original scheduling code largely intact.

Broadcasting across pipeline stages and extending packedSeqParams

Since num_micro_batches varies and only TP rank 0, and the first and last PP stages handle scheduling in Megatron Core, the framework broadcasts num_micro_batches, max_seqlen, and cu_seqlens to all relevant PP ranks. This ensures consistent execution across the pipeline stages under dynamic micro-batch scheduling.

With Dynamic-CP, the effective cp_size can vary between iterations, making it unsafe to rely on globally static CP settings. To address this, PackedSeqParams extends to carry both cp_size and cp_group.

All components that depend on context parallelism—such as position embedding and Transformer Engine attention—now retrieve the CP configuration from PackedSeqParams, replacing the original global CP variables. This guarantees that all CP-related operations remain consistent with the dynamically selected CP layout.

Loss computation and FLOPs calculation

Given variable-length sequences and the THD layout, different sequences contribute different numbers of valid tokens. As a result, loss computation on a per-token basis: loss = loss_over_valid_tokens / total_number_valid_tokens. It avoids bias introduced by padding tokens.

Previous versions of Megatron Core didn’t account for the THD layout and assumed max_seqlen is the effective sequence length when computing FLOPs. Leading to systematic overestimation in variable-length scenarios.

Data scheduler modeling

Transformer workload scales quadratically with sequence length S \(\mathrm{O}(S^2)\). At the same time, activation memory grows linearly \(\mathcal{O}(S)\), meaning even small variances can lead to major imbalances in compute and memory across DP ranks and micro-batches. To balance a large sample’s workload, we may pack small samples together, but this causes severe memory pressure. It’s impossible to equalize FLOPs and memory simultaneously, which drives the scheduling and packing strategies.

Their goal is to approximate an ideal, balanced distribution in which workload and memory are evenly split across DP ranks and micro-batches. With a fixed number of micro-batches per DP rank, a target workload and memory quota are set for each micro-batch. A three-stage scheduler then alternates between workload and memory objectives, increasing CP size for heavier samples as needed. compute and memory balance.

Collaboration of cost model, solver, and simulator

A complete scheduler workflow consists of three components:

The cost model estimates execution time for each sample based on its sequence length, modeling the per-sample workload across transformer operations. This defines the basic load unit, and its accuracy impacts final performance gains.
The solver uses the cost model output as input and applies a heuristic algorithm to determine a near-optimal packing strategy for each sample. The packed samples are then grouped into micro-batches and assigned a context-parallel CP size. The number of micro-batches per DP rank impacts the results, pipeline bubbles, pipeline-parallel imbalance bubbles, and data-parallel imbalance bubbles. Iterating over different microbatch counts per DP rank yields the best outcome.
The simulator evaluates these micro-batches under the distributed pipeline parallel schedule. It selects the plan with the minimum execution time (i.e., the most balanced workload) that also satisfies peak-memory constraints.

Modeling process and bi-objective balance

The ideal balanced distribution is an evenly split workload and memory across different DP ranks and different micro-batches. Given the same number of micro-batches across different DP ranks, the target workload and memory quotafor each micro-batch is determined. The pipeline bubble also differs and needs to be distributed evenly to each microbatch for the end-to-end balance.

Equalizing the end-to-end training time across DP ranks suggests:

\(W_1 \cdot (m_1 V + p – 1) = W_2 \cdot (m_2 V + p – 1)\)

\(m_i\) is the number of microbatches of DP rank
\(W_i\) is the quota of each microbatch of the \(i\)-th DP rank
\(V\) is the number of virtual pipeline stages
\(p\) is the pipeline stage number

The workload quotas across ranks satisfy:

\(W_2 = W_1 \cdot \frac{m_1 V + p – 1}{m_2 V + p – 1}\)

Meanwhile, the total workload of a global batch of samples can be represented as:

\(\sum_{i=1}^{DP} W_i \cdot m_i = \mathrm{sum}_{\mathrm{batch}}(\mathrm{workload})\)

Combining 2 and 3, the workload of each micro-batch of different DP can be determined.

Because the computational workload scales as \(\mathrm{O}(S^2)\) with respect to sequence length, while memory consumption scales as \(\mathcal{O}(S)\), it is difficult to achieve both workload and memory balance simultaneously. Instead, the solver alternates between workload-oriented and memory-oriented objectives across stages, gradually approaching a balanced solution.

Samples whose workload exceeds the micro-batch workload quota are assigned a larger CP size. After this step, workload imbalance is reduced, and memory becomes the dominant constraint. The target then shifts to memory, selecting the least compute-heavy sample to fill the bucket. The remaining samples are sorted in descending order and assigned to each microbatch using the same heuristic.

Zero-overhead execution

At runtime, the scheduling workflow must not introduce noticeable overhead into the training loop. In practice, the system needs to overcome I/O pressure and solver runtime.

I/O pressure

First, constructing a scheduling plan requires an extra get_item pass over the global batch to collect sequence length and shape information. Two complementary techniques alleviate I/O pressure by distributing the probing get_item across the cluster and gathering only lightweight shape and sequence-length metadata through an additional communication step.

Solver runtime

To avoid blocking the main training process, the solver runs asynchronously in the data_sampler so that it overlaps with training iterations. To keep space manageable, exhaustive search is replaced with a small grid search. All DP ranks are constrained to use the same number of micro-batches, and this count is swept from PP*1 up to a small multiple of PP.

Under a fixed global batch size, this one-dimensional grid captures the trade-off between per-microbatch workload and pipeline bubbles. Figure 7 shows that workload variance quickly shrinks as the microbatch count grows. The “knee” point on this curve is selected, and the search is limited to its neighborhood to keep solver overhead practical.

Benchmark results

With all the enhancements introduced, the imbalance bubbles caused by variable-length sequence distributions can be substantially reduced.

In Table 1, Dynamic CP is evaluated against a pure packing baseline under the following setup: llama-13B, global batch size 2048, PP=8, CP=8, and full recompute. 10 iterations are run, with the first discarded as a warm-up, and the iteration time is averaged over the remaining 9. Dynamic CP achieves 1.48x and 1.25x speedups on the GitHub and CommonCrawl datasets, respectively.

In a multi-thousand-GPU industrial environment, the Dynamic CP method yields over 35% end-to-end performance improvement.

Model Size	Dataset type	Method	TFLOPS/GPU
Llama 13B	GitHub	Only Packing	195.88
Llama 13B	GitHub	Dynamic CP	289.32
Llama 13B	Commoncrawl	Only Packing	139.17
Llama 13B	Commoncrawl	Dynamic CP	174.39

Table 1. Comparison of Dynamic CP and pure packing methods across different datasets

Learn more

This post showed that Dynamic CP with the Megatron Core backend improves the training throughput for variable-length sequences compared to the fixed CP method. With sequence packing, 4D parallelism, and GPU-optimized kernels, Dynamic CP guarantees high training efficiency across model scales.

Get started with:

Megatron Core GitHub to start training your model with variable-length sequences using Megatron Core optimizations.
The scheduler, which is also on GitHub.

Thanks to every colleague for their contributions to this project.