Agentic AI / Generative AI

Announcing Nemotron-CC: A Trillion-Token English Language Dataset for LLM Pretraining

Jan 09, 2025

By Dan Su, Kezhi Kong and Ying Lin

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA has released Nemotron-CC, a 6.3-trillion-token English language dataset for pretraining large language models (LLMs), including 1.9 trillion tokens of synthetically generated data.
Nemotron-CC improves upon existing datasets by using a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters to create a high-quality dataset.
The dataset has been shown to be effective in training highly accurate LLMs, with an 8 billion parameter model trained on Nemotron-CC outperforming the Llama 3.1 8B model in several tasks.

AI-generated content may summarize information incompletely. Verify important information. Learn more

NVIDIA is excited to announce the release of Nemotron-CC, a 6.3-trillion-token English language Common Crawl dataset for pretraining highly accurate large language models (LLMs), including 1.9 trillion tokens of synthetically generated data. One of the keys to training state-of-the-art LLMs is a high-quality pretraining dataset, and recent top LLMs, such as the Meta Llama series, were trained on vast amounts of data comprising 15 trillion tokens.

But little is known about the exact composition of these 15 trillion tokens. Nemotron-CC aims to remedy this and enable the wider community to train highly accurate LLMs. Internet crawl data, typically from Common Crawl, is generally the largest source of tokens. Recent open Common Crawl datasets, such as FineWeb-Edu and DCLM, have shown how to greatly improve benchmark accuracies over relatively short token horizons. However, this has been accomplished at the cost of removing 90% of data. This limits the suitability for long token horizon training, such as 15 trillion tokens for Llama 3.1.

Nemotron-CC fills this gap and shows how to transform Common Crawl data into a high-quality dataset suitable for training LLMs better than Llama 3.1 8B through a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters.

Results

Shown in Figure 1 are MMLU scores when training 8B parameter models for 1 trillion tokens, varying only the 73% English Common Crawl portion of the training data. Compared to the leading open English Common Crawl dataset DCLM, the high-quality subset Nemotron-CC-HQ increases the MMLU by +5.6.

Furthermore, the full 6.3-trillion-token dataset matches DCLM on MMLU, but contains four times more unique real tokens. This unlocks effective training over a long token horizon: an 8 billion parameter model trained for 15 trillion tokens, of which 7.2 trillion came from Nemotron-CC, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks.

Key insights

Some of the key insights that led to these results include:

Ensembling different model-based classifiers can help select a larger and more diverse set of high quality tokens.
Rephrasing can effectively reduce noise and errors in low-quality data and produce diverse variants with fresh unique tokens from high-quality data, leading to better results in downstream tasks.
Disabling traditional non-learned heuristic filters for high-quality data can further boost high quality token yield without hurting accuracy.

Data curation steps

Using NVIDIA NeMo Curator, we extracted and cleaned data from Common Crawl and then:

Filtered it for the English language
Performed global fuzzy deduplication as well as exact substring deduplication
Leveraged model-based filters such as DCLM, fineweb-edu for quality classification
Applied various heuristic and perplexity filters to further remove lower-quality data

We also leveraged synthetic data generation pipelines to generate ~2 trillion tokens of synthetic data.

The full recipe including the synthetic data generation pipelines will be merged into the NVIDIA/NeMo-Curator GitHub repo soon. To receive updates, star the repo.

Conclusion

Nemotron-CC is an open, large, high-quality English Common Crawl dataset that enables pretraining highly accurate LLMs over both short and long token horizons. In the future, we hope to release more datasets that are key ingredients for state-of-the-art LLM pretraining, such as a specialized math pretraining dataset.

Download the dataset from Common Crawl.
Use NeMo Curator to curate your own datasets.
Learn more about the technical details in Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset.

Acknowledgments

We thank the Common Crawl Foundation for hosting the dataset. We thank Pedro Ortiz Suarez for valuable feedback that improved the paper and Greg Lindahl for help with improving the data formatting and layout.

Discuss (0)

About the Authors

About Dan Su
Dan Su is a research scientist at NVIDIA. Her current research is focused on large language model pretraining. She received her PhD in NLP from the Hong Kong University of Science and Technology.

View all posts by Dan Su

About Kezhi Kong
Kezhi Kong is a research scientist at NVIDIA and a member of the Foundation Model team. He received his PhD from the Computer Science Department of University of Maryland, College Park. His research focuses on building state-of-the-art large language models, especially through improved quality and extended scale of pretraining data as well as enhanced pretraining algorithms.

View all posts by Kezhi Kong

About Ying Lin
Ying Lin is a research scientist at NVIDIA, where his work mainly focuses on enhancing pretraining data quality and generating synthetic data. Prior to joining NVIDIA, he worked on natural language understanding at Apple. He earned his PhD from the University of Illinois Urbana-Champaign.

View all posts by Ying Lin