In the rapidly evolving landscape of artificial intelligence, the quality of the data used for training models is paramount. High-quality data ensures that models are accurate, reliable, and capable of generalizing well across various applications. The recent NVIDIA webinar, Enhance Generative AI Model Accuracy with High-Quality Multimodal Data Processing, dove into the intricacies of data curation and processing, highlighting the capabilities of NVIDIA NeMo Curator.
This post shares the key insights from the webinar, focusing on the importance of data curation, the role of synthetic data generation, and the various features available to developers for building fully customized and scalable data-processing pipelines.
The importance of data curation
Data curation is a critical step in the development of generative AI models. It involves cleaning, organizing, and preparing data to ensure that it is suitable for training.
The webinar emphasized that generative models derive their understanding from the data on which they are trained. Ensuring that this data is free from duplicates, personal identifiable information (PII), and toxic content is crucial.
Proper data curation not only reduces training time but also enhances model quality, making it a vital process for developers aiming to build robust AI systems.
Overview of NeMo Curator
NeMo Curator is a powerful tool designed to help you extract the most value from your raw datasets, transforming them into high-quality, consumable data to ensure high downstream model accuracy. As data volumes have exploded, having a scalable and efficient data pipeline is more important than ever.
NeMo Curator supports the processing of text, image, and video modalities and can scale up to 100+ PB of data quickly and efficiently, ensuring that your models remain up-to-date without suffering from model drift.
NeMo Curator provides a customizable and modular interface, enabling you to select the building blocks for your data processing pipelines and perform them in the order that makes sense for your business-specific use case .
Text-processing pipelines
NeMo Curator provides comprehensive features for building data-processing pipelines, including text.
A reference pipeline starts with data extraction from sources such as the Internet or private repositories, converting content into a standardized format such as Parquet or JSON. The pipeline then cleanses the data, removing boilerplate text, unifies all Unicode characters, and discards redundant information. It also de-duplicates content to ensure unique and valuable knowledge is retained, using exact, fuzzy, and semantic deduplication filters.
Finally, NeMo Curator enhances the data with quality filters, adding metadata and annotations to ensure it’s ready for blending and shuffling before model training. This streamlined, high-quality data processing results in models with higher accuracy.
Image– and video-processing pipelines
In the webinar, we discussed the canonical pipelines for image and video processing and the features that are currently available for you to try.
On a high level, the image-processing pipelines contain several steps: cleaning and preprocessing, model-based filtering, semantic deduplication, and sharding. For more information about image curation, see the Image Curation in NeMo Curator tutorial on GitHub.
The video processing pipelines also contain several steps, including splitting and transcoding, filtering, annotation, deduplication, and dataset creation. To get notified about support for video processing, sign up for NVIDIA Generative AI News.
Synthetic data generation
Synthetic data generation is a powerful tool for creating entirely new datasets or augmenting existing ones, especially when real-world data is scarce or difficult to obtain.
The webinar showcased how NVIDIA NeMo Curator can generate synthetic records using large language models (LLMs). By employing prompt templates, you can create diverse data variants, which are then scored for quality using reward models. This iterative process of generating and curating synthetic data ensures that the final dataset is both comprehensive and high-quality, ready for model training.
NeMo Curator offers prebuilt pipelines that help you get started quickly. It also enables the integration of customizable building blocks into existing workflows.
World-class performance
Scalability is a key concern for working with large datasets.
The webinar highlighted how NeMo Curator can handle petabytes of data, thanks to its GPU-accelerated architecture. By using NVIDIA RAPIDS libraries such as cuDF, cuGraph, and cuML and integrating tools like Ray for video processing and Dask for text and image processing, you can scale your data-processing pipelines and process data up to 17x faster.
This scalability ensures that data processing pipelines can grow alongside the increasing demands of AI model training.
Get started
Building data processing pipelines from scratch can be challenging, especially when dealing with different data modalities.
The webinar addressed common challenges such as lack of optimized models and tooling for synthetic data generation. NVIDIA solutions, including pretrained models and enterprise support, help you overcome these hurdles.
NeMo Curator is available in multiple ways:
- NeMo Framework container
- /NVIDIA/NeMo-Curator GitHub repo
- /nemo-curator Pypi package
To get started in production, create a NVIDIA AI Enterprise license and get production-ready branches, security updates, API stability, and support from NVIDIA AI experts.
Conclusion
The NVIDIA webinar underscored the significance of high-quality data in generative AI model development. With NeMo Curator, you have access to powerful resources for data curation, synthetic data generation, and building scalable data processing pipelines.
As the field of AI continues to grow, the importance of data quality and processing will remain at the forefront of successful model development. By addressing the challenges of data processing and offering solutions that enhance efficiency and accuracy, NVIDIA empowers you to build the next generation of AI models with confidence.
For more information about NeMo Curator, see the full webinar at Enhance Generative AI Model Accuracy Through High-Quality Multimodal Data Processing.