How to Build a Generative AI-Enabled Synthetic Data Pipeline for Perception-Based Physical AI

Training physical AI models used to power autonomous machines, such as robots and autonomous vehicles, requires huge amounts of data. Acquiring large sets of diverse training data can be difficult, time-consuming, and expensive. Data is often limited due to privacy restrictions or concerns, or simply may not exist for novel use cases. In addition, the available data may not apply to the full range of potential situations, limiting the model’s ability to accurately predict and respond to diverse scenarios.

Synthetic data, generated from digital twin simulations built in NVIDIA Omniverse and now, upscaled with NVIDIA Cosmos, addresses the gaps in real-world data, enabling developers to bootstrap physical AI model training. You can quickly generate large, diverse datasets by varying many different parameters such as layout, asset placement, location, color, object size, and lighting conditions. This data can then be used to aid in the creation of a generalized model.

To help you build generative AI-powered synthetic data generation pipelines and workflows, check out the Synthetic Data Generation with Generative AI Reference Workflow.

Accelerating the data generation process with generative AI

Achieving physical accuracy is crucial for bridging the sim-to-real domain gap in training perception AI models. A typical synthetic data generation (SDG) process begins by meticulously recreating objects in a virtual environment, and then accurately replicating the materials, textures, and other attributes of their real-world counterparts.

Once the initial 3D scene is constructed, developers employ domain randomization techniques to systematically vary aspects such as lighting, colors, and textures. This randomization generates a diverse set of annotated images, enhancing the model’s ability to generalize. The process iterates, continuously refining the synthetic data and training the model until the desired key performance indicators (KPIs) are achieved.

Despite the effectiveness of traditional SDG methods, many steps—from scene construction to parameter randomization—remain manual and time-consuming. Generative AI offers a powerful solution to streamline this process significantly.

Advanced diffusion models, such as Edify and SDXML, can rapidly generate high-quality visual content from text or image descriptions. Provided the right guardrails, these models excel at programmatically altering image parameters, including layout, asset placement, color schemes, object dimensions, and lighting conditions, vastly reducing manual effort.

Moreover, generative AI enables efficient image augmentation without modifying the entire 3D scene. Using simple text-based prompts, developers can swiftly add realistic details like surface rust or apply selective blur effects. This approach dramatically accelerates the creation of diverse datasets.

To illustrate, Figure 1 shows how a single base image was augmented in four distinct ways using straightforward text prompts. Traditionally, a technical artist would require several hours to implement such modifications and regenerate new images. Generative AI accomplishes this task in a fraction of the time, significantly enhancing productivity and dataset diversity.

Prompt 1

white tiled linoleum floor
green shiny new counterbalance forklift
wooden pallet light colored pine wood, softwood
garbage container

Prompt 2

dark cracked dirty concrete floor
yellow counterbalance forklift
wooden pallet light colored pine wood, softwood
black garbage container

Prompt 3

cracked concrete floor
white counterbalance forklift
wooden pallet light colored pine wood, softwood
garbage container

Prompt 4

green chipped linoleum floor
blue rusty counterbalance forklift
wooden pallet light colored pine wood, softwood
garbage container

Reference workflow overview

The reference workflow is suitable for developers who are training computer vision models in robotics, as well as computer vision applications for smart spaces. The following section describes the key steps of the reference workflow and its core technologies.

Scene creation: A comprehensive 3D warehouse scene serves as the foundation, incorporating essential assets like shelves, boxes, and pallets. This base environment can be dynamically enhanced using 3D NIM microservices, enabling the seamless addition of diverse objects and the integration of 360° HDRI backgrounds.
Domain randomization: Developers can leverage USD Code NIM, a cutting-edge large language model (LLM) specialized in OpenUSD, to perform domain randomization. This powerful tool not only answers OpenUSD-related queries but also generates USD Python code to make changes in the scene, streamlining the process of programmatically altering various scene parameters within Omniverse Replicator.
Data generation: The third step involves exporting the initial set of annotated images. Replicator offers a wide array of built-in annotators, including 2D bounding boxes, semantic segmentation, depth maps, surface normals, and numerous others. The choice of output format (such as bounding boxes or segmentation masks) depends on the specific model requirements or use case. Data can be exported using various writers: the BasicWriter for standard output, the KittiWriter for KITTI format, or custom writers for COCO format.
Data augmentation: In the final stage, developers can leverage generative AI models like SDXL and Edify with ComfyUI, a versatile open-source platform for constructing and executing diffusion model pipelines.

Some of core technologies in this workflow include:

Edify 360 NIM: Shutterstock Early Access preview of Generative 3D service for 360 HDRI (High Dynamic Range Image) generation. Trained on NVIDIA Edify using Shutterstock’s licensed creative libraries.
Edify 3D NIM: Shutterstock Generative 3D service for 3D asset generation, used for additional 3D objects for scene dressing. Trained on NVIDIA Edify using Shutterstock’s licensed creative libraries.
USD Code : A language model that answers OpenUSD knowledge queries and generates USD Python code.
USD Search: An AI-powered search for OpenUSD data, 3D models, images, and assets using text or image-based inputs.
Image generation models plus ComfyUI: Fast generative text-to-image models that can synthesize photorealistic images from a text prompt in a single network evaluation with a graph and nodes interface for advanced developers.
Omniverse Replicator: A framework for developing custom SDG pipelines and services and is integrated as an extension in NVIDIA Isaac Sim.

Datasets can be further multiplied using NVIDIA Cosmos world foundation models. Developers can output images or video renders from NVIDIA Omniverse, then upscale them from 3D-to-real with the models, helping generate an exponentially larger dataset.

Using this workflow guide, you’ll be able to develop your custom SDG pipeline that can be used for training various types of perception AI models, from detection to classification and segmentation. By implementing this reference workflow, you or your client will benefit from:

Accelerated AI model training: Overcome the data gap and accelerate AI model development while reducing the overall cost of acquiring and labeling data required to train text, visual, and physical AI models.
Privacy and security: Address privacy issues and reduce bias by generating diverse synthetic datasets to represent the real world.
Increased model accuracy: Create highly accurate, generalized AI models by training with diverse data that includes rare but crucial corner cases that are otherwise impossible to collect.
Scalability: Procedurally generate data with an automated pipeline that scales with your use case across manufacturing, automotive, robotics, and more.

Get started step-by-step with the Synthetic Data Generation with Generative AI Reference Workflow.

Stay up to date by subscribing to our newsletter, and following NVIDIA Robotics on YouTube, Discord, and the NVIDIA Developer forums.