NVIDIA Megatron Core

Train generative AI models from scratch at scale.

NVIDIA Megatron Core is an open-source library to train models with unparalleled speed at scale across thousands of GPUs. It features advanced parallelism strategies, cutting-edge optimizations like FP8 training, and support for the latest LLM, MoE and multimodal architectures. Megatron Core integrates with NVIDIA NeMo, Transformer Engine, and ecosystem libraries to provide complete solutions for production training and research.

Go to GitHub Get Started

Explore Features and Benefits of NVIDIA Megatron-Core

Parallelism Techniques

The Megatron Core library offers advanced model parallelism techniques, including tensor, sequence, pipeline, context, and MoE expert parallelism, for large-scale training. Users can combine different parallelism strategies to optimize their training workloads.

Additionally, Megatron Core offers memory-saving functionalities including activation checkpointing, distributed optimizers, and distributed checkpointing.

Learn more in the API documentation.

Customizable Building Blocks

Megatron-Core offers customizable building blocks with modular and composable APIs. For transformer models, it offers attention mechanisms, normalization layers, embedding techniques, and more.

Learn more about the MCore Spec system in documentation.

Scalability and Training Resiliency

Efficiently train large models at scale with training resiliency features such as automatic restart, fault/hang detection, and fast distributed checkpointing.

Learn more about how Megatron-Core enabled the training of a Nemotron-4 340B model at up to 6K+ H100 GPUs scale while achieving high per-GPU throughput.

See details on performance in this scalability benchmark. Learn more about resiliency features supported in NVIDIA Resiliency Extension.

Cutting-Edge Research

Leverage NVIDIA's cutting-edge research to stay at the forefront of distributed training by simply upgrading to the latest Megatron-Core.

Pioneering large-model training since 2019, Megatron Core continues to lead the innovations in large-scale training with features like FP8 mixed precision and advanced parallelism.

Learn about some of the recent advancements in this blog.

Train With Mixture-of-Experts

Pretrain models with Mixture-of-Experts (MoE), a popular technique to achieve better accuracy without increasing computation resources.

Megatron-Core offers performant functionality for both token dropless and token dropping use cases, with training speed optimizations for models such as DeepSeek, Mixtral, and Qwen MoE.

Learn more about MoE features in our repository.

Beyond Transformers: Hybrid Models

Megatron-Core expanded its support from Transformer-based models to hybrid models that combine state space models, state space dualities, and recurrent neural networks.

Hybrid models have emerged as a compelling model architecture for sequence modeling tasks, as they overcome several limitations of attention.

Learn more about training Mamba-based hybrid models in our paper and code example.

Multimodal Training

Train with multimodality using Megatron-Core’s parallelism techniques and its multi-modal data loader library, featuring determinism and reproducibility when blending multimodal datasets.

Get started with the LLaVA (large language and vision assistant) training pipeline.

Get Started

NVIDIA NeMo

Megatron-LM

Using Megatron-Core with NVIDIA NeMo

NVIDIA NeMo is an end-to-end platform for developing custom generative AI—including LLMs and multimodal, vision, and speech AI. NeMo builds on NVIDIA NeMo and is suited for developers building enterprise-ready generative AI applications.

Learn More

Using Megatron-Core with NVIDIA Megatron-LM

Megatron-LM is an open-source lightweight training framework with a native PyTorch training loop for exploring Megatron-Core. It’s easily customizable and is suitable for researchers who prefer minimum abstraction layers on top of Megatron-Core’s training techniques.

Learn More

World-Leading Training Speed and Scalability

Megatron-Core is capable of efficiently training large language models with its parallelism techniques. In the weak scaling experiments below, with GPT models ranging from 2 billion to 462 billion parameters, Megatron-Core demonstrates superlinear scaling up to 6144 H100 GPUs.


Model size	Tensor MP size	Pipeline MP size	Data-parallel size	Number of GPUs	Batch size	Per-GPU teraFLOP/s	MFU
2.1B	1	1	16 64	16 64	256	441 412	45% 42%
4.2B	2	1	16 64	32 128	256	431 415	44% 42%
8.3B	4	1	16 64	64 256	256	457 426	46% 43%
19.7B	8	1	16 64	128 512	512	439 429	44% 43%
41B	8	1	32 128	256 1024	768	469 439	47% 44%
78B	8	2	32 96	512 1536	960	446 418	45% 42%
148B	8	4	24 72	768 2304	1152	456 432	46% 44%
314B	8	8	16 48	1024 3072	1152	490 464	50% 47%
509B	8	20	8 24	1280 3840	1440	473 426	48% 43%

Aggregate Throughput (Weak Scaling)

A graph showing NVIDIA Megatron-Core aggregate throughput for weak scaling

Aggregate Throughput (Strong Scaling)

In the strong scaling setting with a 177 billion parameter GPT-3 model using the same batch size of 1152 sequences throughout, Megatron-Core demonstrates near linear scaling from 96 to 4608 H100 GPUs.

A graph showing NVIDIA Megatron-Core aggregate throughput for strong scaling

See Detailed Benchmark

Resources

Use Megatron to Train Large Models at Unparalleled Speed

Get Started