NVIDIA Megatron Core
Train generative AI models at scale.
NVIDIA Megatron Core is a composable training library for large-scale generative AI. It provides GPU-optimized building blocks for training and post-training workflows, so teams can build custom systems with the performance, flexibility, and scale required for modern LLM, MoE, and multimodal development.Megatron Core integrates with Megatron Bridge, Transformer Engine, and ecosystem libraries to provide complete solutions for production training and research.
Explore Features and Benefits of NVIDIA Megatron-Core
Parallelism Techniques
The Megatron Core library offers advanced model parallelism techniques, including tensor, sequence, pipeline, context, and MoE expert parallelism, for large-scale training. Users can combine different parallelism strategies to optimize their training workloads.
Additionally, Megatron Core offers memory-saving functionalities including activation checkpointing, distributed optimizers, and distributed checkpointing.
Learn more in the API documentation.
Customizable Building Blocks
Megatron-Core offers customizable building blocks with modular and composable APIs. For transformer models, it offers attention mechanisms, normalization layers, embedding techniques, and more.
Learn more about the MCore Spec system in documentation.
Scalability and Training Resiliency
Efficiently train large models at scale with training resiliency features such as automatic restart, fault/hang detection, and fast distributed checkpointing.
Learn more about how Megatron-Core enabled the training of a Nemotron-4 340B model at up to 6K+ H100 GPUs scale while achieving high per-GPU throughput.
See details on performance in this scalability benchmark. Learn more about resiliency features supported in NVIDIA Resiliency Extension.
Cutting-Edge Research
Leverage NVIDIA's cutting-edge research to stay at the forefront of distributed training by simply upgrading to the latest Megatron Core.
Pioneering large-model training since 2019, Megatron Core continues to lead the innovations in large-scale training with features like FP8 mixed precision and advanced parallelism.
Learn about some of the recent advancements in our latest news.
Train With Mixture-of-Experts
Pretrain models with Mixture-of-Experts (MoE), a popular technique to achieve better accuracy without increasing computation resources.
Megatron Core offers performant functionality for both token dropless and token dropping use cases, with training speed optimizations for models such as DeepSeek and Qwen MoE.
Learn more about MoE features in our repository.
Beyond Transformers: Hybrid Models
Megatron Core expanded its support from Transformer-based models to hybrid models that combine state space models, state space dualities, and recurrent neural networks.
Hybrid models have emerged as a compelling model architecture for sequence modeling tasks, as they overcome several limitations of attention.
Multimodal Training
Train with multimodality using Megatron Core’s parallelism techniques and its multi-modal data loader library, featuring determinism and reproducibility when blending multimodal datasets.
Get started with the LLaVA (large language and vision assistant) training pipeline.
Get Started
Using Megatron Core with NVIDIA NeMo
Megatron Bridge is a library within the NeMo Framework that serves as a bridge between 🤗 Hugging Face and Megatron Core with verified scripts for the latest models. It provides bidirectional checkpoint conversion between these formats, enabling other projects to leverage Megatron Core’s parallelism capabilities or export models for various inference engines.
Learn MoreUsing Megatron Core with NVIDIA Megatron-LM
Megatron-LM is an open-sourcetraining reference framework for exploring Megatron Core. It’s easily customizable and is suitable for researchers who prefer minimum abstraction layers on top of Megatron Core’s training techniques.
Learn MoreWorld-Leading Training Speed and Scalability
Megatron Core is capable of efficiently training large language models with its parallelism techniques. In the weak scaling experiments below, with GPT models ranging from 2 billion to 462 billion parameters, Megatron Core demonstrates superlinear scaling up to 6144 H100 GPUs.
Model size |
Tensor MP size |
Pipeline MP size |
Data-parallel size |
Number of GPUs |
Batch size |
Per-GPU teraFLOP/s |
MFU |
Aggregate petaFLOP/s |
|---|---|---|---|---|---|---|---|---|
2.1B |
1 |
1 |
16 64 |
16 64 |
256 |
441 412 |
45% 42% |
7.1 26.3 |
4.2B |
2 |
1 |
16 64 |
32 128 |
256 |
431 415 |
44% 42% |
13.8 53.1 |
8.3B |
4 |
1 |
16 64 |
64 256 |
256 |
457 426 |
46% 43% |
29.3 109.1 |
19.7B |
8 |
1 |
16 64 |
128 512 |
512 |
439 429 |
44% 43% |
56.2 219.7 |
41B |
8 |
1 |
32 128 |
256 1024 |
768 |
469 439 |
47% 44% |
119.9 449.8 |
78B |
8 |
2 |
32 96 |
512 1536 |
960 |
446 418 |
45% 42% |
228.6 641.3 |
148B |
8 |
4 |
24 72 |
768 2304 |
1152 |
456 432 |
46% 44% |
349.8 996.2 |
314B |
8 |
8 |
16 48 |
1024 3072 |
1152 |
490 464 |
50% 47% |
502.2 1425.1 |
509B |
8 |
20 |
8 24 |
1280 3840 |
1440 |
473 426 |
48% 43% |
605.8 1635.1 |
Aggregate Throughput (Weak Scaling)
Aggregate Throughput (Strong Scaling)
In the strong scaling setting with a 177 billion parameter GPT-3 model using the same batch size of 1152 sequences throughout, Megatron-Core demonstrates near linear scaling from 96 to 4608 H100 GPUs.
Resources
Use Megatron to Train Large Models at Unparalleled Speed