Robotics

Curating Synthetic Datasets to Train Physical AI Models with NVIDIA Cosmos Reason

May 18, 2025

By Tsung-Yi Lin and Pranjali Joshi

Discuss (0)

AI-Generated Summary

Dislike

NVIDIA Cosmos Reason is a world foundation model for physical AI that understands space, time, and physics to critique synthetic data and build curated datasets for training embodied AI systems.
Cosmos Reason uses supervised fine-tuning and reinforcement learning to interpret visual inputs, analyze them with text prompts, and generate optimal decisions or captions through chain-of-thought reasoning.
Developers can access Cosmos Reason model checkpoints on Hugging Face, inference scripts on GitHub, and NVIDIA Omniverse for high-fidelity simulation to generate physically accurate motion data at scale.

AI-generated content may summarize information incompletely. Verify important information. Learn more

How can an AI system understand the difference between a plausible accident and a physically impossible event? Or plan a multi-step interaction across humans, objects, and environments in an edge-case scenario? These are questions at the core of physical intelligence—the kind that underpin how robots manipulate the world, how autonomous vehicles make split-second decisions, and how virtual agents simulate reality.

Explore the NVIDIA Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.

NVIDIA Cosmos Reason is a world foundation model (WFM) for physical AI—built not just to see, but to reason. Trained to understand space, time, and physics, it can critique synthetic data and build curated datasets to train embodied AI systems like robots and autonomous vehicles to act more realistically. This post covers how Cosmos Reason is developed, where it’s used, and how you can use openly available model checkpoints and scripts to run the model for physical AI tasks.

Recap: NVIDIA Cosmos world foundation models for physical AI

Cosmos is a WFM development platform. At the core are Cosmos WFMs which are pretrained, multimodal models designed to understand and generate world states as videos to replicate physical worlds to train physical AI systems.

These models learn from over 20M hours of robotics and driving data, enabling them to predict how environments change over time or adapt scenes to new conditions. With NVIDIA Cosmos Predict, developers can generate future frames from text, images, or video. With NVIDIA Cosmos Transfer, they can relight or change environments in videos to develop diverse, physics-aware training data at scale. Cosmos also provides tools to curate data, tokenize it, and post-train the models for specific robots or autonomous systems or downstream tasks.

Cosmos Reason for scalable robotics training data

First unveiled at NVIDIA GTC 2025, Cosmos Reason is now available to transform how synthetic data is generated and curated for training physical AI systems. It is an open, spatiotemporally aware reasoning model that interprets visual inputs, analyzes them in the context of a provided text prompt, runs chain-of-thought reasoning to reward responses, and generates optimal decisions or captions.

Inside Cosmos Reason

Cosmos Reason is built using supervised fine-tuning (SFT) and reinforcement learning that bridges multimodal perception and real-world decision-making:

Physical AI SFT: Focuses on real-world reasoning. Learns object affordances (e.g., “a pan conducts heat”), action chains (multi-step plans), and spatial feasibility (e.g., “a person can’t walk through walls”) using curated physical interaction datasets.
Reinforcement learning for embodied decisions: The long chain-of-thought reasoning capability in Cosmos Reason enables training with a small training size and generalizing to held-out test scenarios. Verifiable Physical AI rewards like “arrow-of-time” enable learning world dynamics without human annotations.

Testing Cosmos Reason on common sense

Cosmos Reason excels at understanding real-world physical situations—like how objects and people interact in dynamic environments—using both video and text. Evaluated across benchmarks like BridgeData V2, RoboVQA, and Agibot, the model shows strong common-sense reasoning and situational awareness.

Fine-tuning on physical AI tasks boosts the base vision-language model’s performance by over 10%, while reinforcement learning adds another 5% gain. On average, Cosmos Reason achieves a score of 65.7 across key benchmarks, setting a high bar for AI systems in robotics, autonomous vehicles, and embodied agents.

There’s still room for improvement: post-training on high-quality, task-specific curated data and continued reinforcement learning can further enhance performance of Cosmos Reason.

Common Sense	BridgeData V2	RoboVQA	Agibot	HoloAssist	AV	RoboFail	Avg.
56.2	73.5	86.8	54.2	60	67	62.0	65.7

Table 1. Cosmos Reason performance results on physical common sense and embodied reasoning benchmarks

How to use Cosmos Reason

Developers can download the model checkpoints from Hugging Face and get the inference scripts and post-training from GitHub.

The model takes a video input at a low resolution, like 604X480, along with a text prompt that specifies the developer’s intent, such as a question or explanation, guiding the model to reason and respond accordingly. Developers can also use the prompt upsampler model to improve text prompts.

Cosmos WFMs, including Cosmos Reason, are optimized for best performance on NVIDIA AI. To run the models, developers can set up a Docker environment or run it in their own environment.

For larger industrial workloads and to run vision AI pipelines, developers can use the power of NVIDIA Blackwell GB200 on NVIDIA DGX Cloud and run accelerated inference on NVIDIA Hopper H100 or NVIDIA Ampere A100 GPUs using inference scripts.

Cosmos WFMs power scalable synthetic data generation pipelines that help train robotic systems with greater efficiency, and coverage than traditional methods.

Cosmos Reason generates diverse, realistic prompts for Cosmos Predict and curates high-quality synthetic data from video using text-based controls. Together, they power workflows like NVIDIA Isaac GR00T Dreams to produce physically accurate motion data at scale.

Integrated with NVIDIA Omniverse for high-fidelity simulation, Cosmos streamlines the entire loop—from data generation to deployment—accelerating robotics development beyond the limits of real-world data.

Get started

Download the model from Hugging Face to start experimenting with model checkpoints.

Access inference and post-training scripts on GitHub to customize for your own data.

Explore the Cosmos documentation for in-depth tutorials, implementation details, and practical use cases.

Watch the COMPUTEX keynote from NVIDIA founder and CEO Jensen Huang, as well as NVIDIA GTC Taipei 2025 sessions.

Tune into our upcoming OpenUSD Insiders livestream, Wednesday, May 28, at 11 am PDT for a recap of the Cosmos reason release and other top physical AI announcements from GTC Taipei at Computex.

Stay up to date by subscribing to NVIDIA news, and following NVIDIA Omniverse on Discord, and YouTube.

Visit our Omniverse developer page to get all the essentials you need to get started
Access a collection of OpenUSD resources, including the new self-paced Learn OpenUSD training curriculum
Connect with the Omniverse Developer Community

Get started with developer starter kits to quickly develop and enhance your own applications and services.

Visit our Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.
Explore new open Cosmos models and datasets on Hugging Face and GitHub or try models on build.nvidia.com.
Be part of the community and join our Cosmos Discord channel.

Already using Cosmos? Learn more about how to contribute.

Discuss (0)

About the Authors

About Tsung-Yi Lin
Tsung-Yi Lin is a principal research scientist at NVIDIA Research. He works on computer vision and machine learning. He did his PhD at Cornell University and Cornell Tech and his masters at University California, San Diego. He received the Best Student Paper Award for Focal Loss at ICCV 2017. He led the creation of the COCO dataset which received the PAMI Mark Everingham Prize at ICCV 2023 and Koenderink Prize at ECCV 2024.

View all posts by Tsung-Yi Lin

About Pranjali Joshi
Pranjali Joshi is the product marketing manager for NVIDIA Omniverse, focusing on core technologies and visual generative AI. She holds an M.Sc. degree in data science and marketing strategy from the University of Maryland and a B.Sc. in electronics engineering. Previously, she worked at Accenture and Hitachi Vantara in software development and technology marketing roles.

View all posts by Pranjali Joshi