Simulation / Modeling / Design

Scale Realistic Robot Simulation Using the NVIDIA NeMo Agent Toolkit for Physical AI

Jun 11, 2025

By Alex Qi, Victor Yudin and Damien Fagnou

Discuss (0)

AI-Generated Summary

Dislike

A multi-agent workflow using NVIDIA NeMo Agent toolkit, NVIDIA Omniverse, OpenUSD, NVIDIA Cosmos, and NVIDIA NIM microservices can automate the generation of high-quality synthetic datasets for robotic policy training, enhancing realism and scalability.
By interpreting a single text prompt, a network of specialized agents, including planning, realism augmentation, reasoning, and supporting helper agents, collaboratively execute tasks to produce diverse and realistic scenarios for training robust navigation systems.
The automated synthetic data generation workflow is designed to be scalable, efficient, and modular, allowing developers to generate a wide variety of training scenarios without requiring deep 3D expertise, thereby accelerating robotic policy training and deployment.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Physical AI enables autonomous systems—think robots, self-driving cars, and smart spaces—to perceive, understand, and act intelligently in the real world. However, effectively training these complex systems takes immense and diverse datasets. Relying solely on real-world data collection is often costly, time-consuming, and limited by safety and practical constraints.

To overcome this, developers and roboticists are embracing synthetic data generation (SDG) to create diverse and realistic scenarios in controlled, scalable environments rapidly. Yet, current SDG is often manual, with limited tooling, preventing full automation. To truly accelerate and scale this, artificial general agents, which connect reasoning models with AI general intelligence (AGI) models, are crucial. These multi-agent systems empower developers to tackle complex tasks with a network of specialized AI agents, boosting synthetic data quality and speeding up physical AI development.

This blog post shows a multi-agent workflow using generative AI to systematically produce high-quality synthetic datasets, accelerating robot training and deployment. It uses the NVIDIA NeMo Agent toolkit, NVIDIA Omniverse, OpenUSD, NVIDIA Cosmos, and NVIDIA NIM microservices to build an automated pipeline for augmenting 3D environments, enhancing realism, and scaling synthetic data for robotic policy training.

Multi-agent SDG workflow overview

This section outlines the fully automated SDG workflow and how it addresses key challenges faced by robotics developers.

Robotics developers building robust navigation systems for warehouse environments likely face limitations of default simulation setups that often lack enough complexity and obstacle variety to test navigation algorithms. To enhance algorithm robustness and adaptability, expanding synthetic training environments with diverse and realistic scenarios becomes essential.

However, scaling such environments typically demands deep expertise in 3D workflows or advanced prompting techniques, creating barriers for robotics developers. To simplify this and enable robotics developers to quickly generate high-quality SDG, we demonstrate a natural language-driven, fully automatic SDG generation and augmentation workflow, powered by a multi-agent system.

The following diagram shows the core components of the multi-agent synthetic data generation workflow.

First, a robotics developer begins the process by describing the complete workflow through a single prompt:

Please use planning to complete the following task. 

First, locate and open the sceneblox scene, it's in the /usd/Scene_Blox directory.
Once the scene is loaded, create an initial robot path from point (-18.222, -17.081) to point (-18.904, -26.693).

Next, search for warehouse-appropriate assets such as shipping crates, storage containers, and mobile trolleys.

Then place two of them in the scene as obstacles that the robot would need to navigate around.

After placing the obstacles, create a new robot path using the same start point (-18.222, -17.081) and end point (-18.904, -26.693) as the original path, but this time ensuring the robot avoids all the obstacles you've added.
Once both paths are set up, capture a video of the viewport showing the robot's navigation.

Finally, use cosmos transfer to enhance the captured video into a photorealistic rendering.
For the enhancement, create a detailed prompt that transforms the scene into a modern e-commerce fulfillment center with direct sunlight streaming through large windows and skylights, creating very bright natural lighting throughout the space. The polished concrete floors reflect the sunlight, tall metal shelving units are organized in rows, conveyor systems are visible, and packing stations are neatly arranged. The warehouse has clean white walls, organized inventory areas, and a professional daytime atmosphere. The final video should look like real footage from an e-commerce fulfillment warehouse during a sunny day.

This activates a network of specialized agents to execute the required tasks collaboratively:

Planning agent: Interprets the user’s high-level goal, breaks it down into sequential, actionable steps, and coordinates other agents to fulfill each step efficiently. It calls on USD Search NIM, USD Code NIM, and an AI understanding agent for environment variation, such as adding obstacles by locating suitable 3D assets and accurately placing them within the scene. See the following step-by-step generated from the planning agent.
Realism augmentation agent: Utilizes world foundation models (WFMs) like NVIDIA Cosmos Transfer microservice preview to enhance the realism and visual fidelity of video outputs and diverse environments using ground-truth simulations captured from Omniverse, for high-quality and robust training data.
Reasoning agent: Using NVIDIA Cosmos Reason, it evaluates generated videos to determine their suitability for navigation policy training, providing critical assessment and ensuring quality control.
Supporting helper agent: Streamlines routine subtasks, including loading scenes into Omniverse and capturing video outputs for subsequent agent processing, enabling a fully automated end-to-end workflow.

All the mentioned AI and helper agents are orchestrated using the Agent toolkit, multi-agent orchestration, and a unified configuration system to automate the SDG workflow. The Agent open source toolkit provides insights into multi-agent systems, tracking costs and performance, and exposing opportunities for optimization.

ChatUSD_Planning Locate and open the sceneblox scene in /usd/Scene_Blox. Create an initial robot path from (-18.222, -17.081) to (-18.904, -26.693). Locate warehouse assets (plastic bins, cardboard boxes, hand trucks). Add two obstacles for navigation. Create a second path avoiding obstacles. Capture viewport video of navigation. Enhance video into a photorealistic e-commerce fulfillment center.
------------------------------------
PLAN: Create robot navigation paths with obstacles and enhance the video to a photorealistic e-commerce fulfillment center
Step 1: Open the sceneblox scene
- Use `kit_open_stage` with path `/usd/Scene_Blox` to load the scene.

Step 2: Create initial robot path
- Use `robot_path {"start_point": [-18.222, -17.081], "end_point": [-18.904, -26.693]}` to generate the first navigation path.

Step 3: Locate warehouse-appropriate assets
- Use `ChatUSD_USDSearch` to search for assets such as plastic storage bins, cardboard shipping boxes, and wheeled hand trucks in the available directories.

Step 4: Place obstacles in the scene
- Use `create_obstacles_along_path {"assets": ["path_to_selected_asset_1", "path_to_selected_asset_2"]}` to add two obstacles along the robot's navigation path.

Step 5: Create alternate robot path avoiding obstacles
- Use `robot_path {"start_point": [-18.222, -17.081], "end_point": [-18.904, -26.693]}` to generate a new path that avoids the placed obstacles.

Step 6: Capture a video of the viewport showing robot navigation
- Use `kit_videocapture /temp/robot_navigation.mp4` to record the animation.

Step 7: Enhance the video to a photorealistic fulfillment center
- Use `cosmos_transfer` with the prompt: "Transform the video into a photorealistic modern e-commerce fulfillment center with direct sunlight streaming through large windows and skylights, polished concrete floors reflecting the sunlight, tall metal shelving units organized in rows, visible conveyor systems, neatly arranged packing stations, clean white walls, organized inventory areas, and a professional daytime atmosphere resembling real footage during a sunny day."

The step-by-step output is generated based on the understanding of the initial user input prompt.

Technical preview

This section outlines the architecture and key components that enable a fully automated, natural language-driven multi-agent SDG workflow.

Core Omniverse sample extensions

omni.ai.aiq.sdg
This sample extension forms the backbone of the SDG workflow. It coordinates a multi-agent system that interprets prompts, modifies scenes, and controls the entire video generation pipeline within the Omniverse kit. Its key functions include:
- Prompt interpretation: Uses a planning agent to break down user goals into actionable steps.
- Scene understanding: Queries the USD stage for object metadata, spatial relationships, and layout.
- Scene modification: Automatically generates and runs Python code to add or move objects, reference assets, or trigger animations with USD Code NIM and USD Search NIM.
- Recording control: Captures rendered video from the viewport with optional arbitrary output variables (AOV) support.
- Realism and validation: Integrates Cosmos NIM (e.g., cosmos_transfer, cosmos_reason) for video enhancement and content assessment.
omni.ai.langchain.agent.headless
This headless automation system extension enables the entire SDG workflow to run in a non-GUI (headless) mode, ideal for cloud deployment and batch processing. It can load a USD stage, execute agents, run video synthesis, and save outputs—all through API calls without user interaction.

System architecture

The SDG workflow is split into two cooperating systems that work in tandem:

1. Scenario definition system

This system is responsible for generating a wide range of scene variations using text prompts. These prompts are designed to control the types of objects, their positions, and spatial relationships within the scene.

Prompt generation: Based on configurable rules or templates (e.g., “Place 3 red cups near the robot”), the system can produce hundreds or thousands of unique prompts.
Prompt management: Prompts can be stored, versioned, and reused, enabling reproducible experiments.
API integration: Each prompt is sent to the video synthesis system through a simple HTTP API call.

2. Video synthesis system

This system interprets each prompt and produces a corresponding video. It’s built for scalable deployment in the cloud using the NVIDIA NIM framework.

USD stage loading: Opens a pre-configured base scene that includes a robot animation and static environment.
Multi-agent execution: The core agent interprets the prompt, searches for matching assets using USD Search NIM, and modifies the scene using scripted actions.
Animation and recording: Once the scene is set, the robot performs its animation, and the viewport is recorded into an MP4 video.
Post-processing: Video is optionally enhanced using cosmos_transfer and validated with cosmos_reason to ensure quality and relevance.
Output delivery: The final video is stored in a cloud bucket or file server, and a URL is returned to the calling service.

End-to-end workflow

Batch prompt generation: The scenario system generates prompts describing different object configurations and scene variations.
Prompt submission: Each prompt is sent to the synthesis API.
Scene assembly and recording: The system builds the scene, runs the animation, and records the video.
Enhancement and validation: Style transfer is applied, and the result is evaluated.
Output aggregation: Video links are collected and returned for training dataset assembly.

Design goals

To summarize, the design of the multi-agent SDG workflow is designed with the following goals:

Scalability: Move from manual, single-scene authoring to large-scale, automated data generation.
Developer efficiency: Eliminate the need for deep 3D or USD scripting. expertise—developers only need to describe the scene in plain language.
Diversity and coverage: Enable rapid generation of a wide variety of training scenarios to improve generalization in robotics models.
Modularity: Easily integrate additional agents, assets, or downstream services.

Next steps

The rapid advancement of physical AI hinges on efficient and scalable training data. As shown, using multi-agent systems provides a powerful solution to the limitations of traditional, manual synthetic data generation. By automating the creation of high-quality synthetic datasets, we can significantly accelerate robotic policy training, validation, and ultimately, real-world deployment.

Dive deeper into the technologies that made this possible:

Explore the Agent toolkit documentation and get started with agentic workflows.
Get started building generative physical AI-powered applications with NVIDIA Omniverse developer resources and documentation.
Understand the power of Universal Scene Description for simulation and discover the capabilities of NVIDIA Cosmos world foundation models for generating photorealistic synthetic data.
Accelerate your AI deployments with optimized inference microservices using the NVIDIA NIM documentation portal.

Watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang at VivaTech 2025, and explore GTC Paris sessions.

Stay up to date by subscribing to NVIDIA news and following NVIDIA Omniverse on Discord and YouTube.

Visit our Omniverse developer page to get all the essentials you need to get started
Access a collection of OpenUSD resources, including the new self-paced Learn OpenUSD training curriculum
Tune into upcoming OpenUSD Insiders livestreams and connect with the NVIDIA Developer Community

Get started with developer starter kits to quickly develop and enhance your own applications and services.

Discuss (0)

About the Authors

About Alex Qi
Alex Qi is a Principal Product Manager at NVIDIA Omniverse, leading generative AI products that transform 3D workflows. During her time at NVIDIA, she’s taken multiple AI software products from concept to market. Before joining NVIDIA, Alex was a design and simulation engineer at Corning Incorporated and served as a visiting researcher at Stanford Robotics Lab. She holds dual-masters degrees from Massachusetts Institute of Technology.

View all posts by Alex Qi

About Victor Yudin
With a proven track record of building foundational technologies across leading companies, Victor has developed core rendering solutions at 3Delight, architected production pipelines at Rodeo FX, and engineered software at MPC. At NVIDIA, Victor co-created the Omniverse Kit omni.ui toolkit and currently lead the development of Chat USD, a cutting-edge multi-agent generative AI system designed for text-driven scene and code editing.

View all posts by Victor Yudin

About Damien Fagnou
Damien Fagnou is currently a Vice President, Omniverse Engineering at NVIDIA, working in the NVIDIA Omniverse group, with a focus on Generative AI and Omniverse Kit. Damien brings together his expertise in software and Film VFX production to help build the Omniverse AI and Application platform.

View all posts by Damien Fagnou