NVIDIA NeMo Curator for Developers
NVIDIA NeMo™ Curator is a GPU-accelerated data-curation tool that improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.
How NVIDIA NeMo Curator Works
NeMo Curator streamlines data-processing tasks such as data downloading, extraction, cleaning, quality filtering, deduplication, and blending or shuffling, providing them as Pythonic APIs, making it easier for developers to build data-processing pipelines. High-quality data processed from NeMo Curator enables you to achieve higher accuracy with less data and faster model convergence, reducing training time.
NeMo Curator supports the processing of text, image, and video modalities and can scale up to 100+PB of data. NeMo Curator leverages NVIDIA RAPIDS™ libraries like cuDF, cuML, and cuGraph, paired with Dask and Ray to scale workloads across multi-node, multi-GPU environments, significantly reducing data processing time.
NeMo Curator provides a customizable and modular interface, allowing you to select the building blocks for your data processing pipelines. Please refer to the architecture diagrams below to see how you can build data processing pipelines.
The architecture diagram below shows the various features available for processing text.
Introductory Blog
Learn about the various features NeMo Curator offers for processing high-quality data in this introductory blog.
Tutorial Notebooks
These tutorials provide the coding foundation for building applications that consume the data that NeMo Curator curates.
Introductory Webinar
Explore how to easily build scalable data-processing pipelines to create high-quality datasets for training and customization.
Documentation
These docs provide an in-depth overview of the various features supported, best practices, and tutorials.
Ways to Get Started With NVIDIA NeMo Curator
Use the right tools and technologies to generate high-quality datasets for LLM training.
Apply
Request early access to the NeMo Curator microservice, a GPU-accelerated data processing microservice to prepare large-scale, high-quality datasets for training and customizing generative AI models.
Download
For those looking to use the NeMo framework for development, the container is available to download for free on the NGC catalog. You can also request a free license to use NVIDIA AI Enterprise in production for 90 days using your existing infrastructure.Pull ContainerRequest a 90-Day License
Access Code
To use the latest pre-release features and source code, NeMo Curator is available as an open-source project on GitHub.
Starter Kits
Start developing your generative AI application with NeMo Curator by accessing tutorials, best practices, and documentation for various use cases.
Text Processing
Process high-quality text data with features such as deduplication, quality filtering, and synthetic data generation.
Image Processing
Process high-quality image data with features such as semantic deduplication, CLIP image embedding, NSFW, and aesthetic filters.
Video Processing
Process high-quality video data with features such as splitting, transcoding, filtering, annotation, and semantic deduplication.
Support for video processing is coming soon!
NVIDIA NeMo Curator Learning Library
More Resources
Ethical AI
NVIDIA’s platforms and application frameworks enable developers to build a wide array of AI applications. Consider potential algorithmic bias when choosing or creating the models being deployed. Work with the model’s developer to ensure that it meets the requirements for the relevant industry and use case; that the necessary instruction and documentation are provided to understand error rates, confidence intervals, and results; and that the model is being used under the conditions and in the manner intended.
