As physical AI systems advance, the demand for richly labeled datasets is accelerating beyond what we can manually capture in the real world. World foundation models (WFMs), which are generative AI models trained to simulate, predict, and reason about future world states based on the dynamics of real-world environments, can help overcome this data challenge.
NVIDIA Cosmos is a platform for WFM development for physical AI, like robotics and autonomous vehicles. Cosmos WFMs include three model types that can be post-trained for specific applications—Cosmos Predict, Cosmos Transfer, and Cosmos Reason.
Cosmos Predict generates “future world states” as videos from image, video, and text prompts. Cosmos Transfer enables developers to perform photoreal style transfers from 2D inputs and text prompts. Cosmos Reason is a reasoning VLM that can then curate and annotate the generated data, and also be post-trained to function as a robot vision-language-action (VLA) model. This data is used to train physical AI and industrial vision AI for understanding spatial awareness, planning motion trajectories, and performing complex tasks.
This edition of NVIDIA Robotics Research and Development Digest (R2D2) explores Cosmos WFMs and workflows from NVIDIA Research. We dive into how they play an important role in synthetic data generation (SDG) and data curation for physical AI applications:
- Cosmos Predict
- Single2MultiView for autonomous vehicles
- Cosmos-Drive-Dreams
- NVIDIA Isaac GR00T-Dreams
- DiffusionRenderder
- Accelerated video generation
- Cosmos Transfer
- Cosmos Transfer for Autonomous Vehicles
- Edge model distillation
- Cosmos Reason
Cosmos Predict: future simulation models from NVIDIA Research for Robotics
Cosmos Predict models can be post-trained for physical AI applications, like robotics and autonomous vehicles. Cosmos Predict takes input in the form of text, images, or videos and generates future frames that are coherent and physically accurate. This accelerates SDG for post-training AI models to perform complex physical tasks. Let’s see some examples of post-training.
Cosmos Predict post-training applications
Single2MultiView for autonomous vehicles is a post-trained version of the Cosmos Predict model. It generates multiple, consistent camera perspectives from a single front-view autonomous driving video. The result is synchronized multi-view camera footage for autonomous vehicle (AV) development.
Inference example with a single-view input video:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/video2world_view_extend_multiview.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir
Cosmos-Predict1-7B-Video2World-Sample-AV-Single2MultiView/t2w_model.pt \
--view_condition_video assets/diffusion/sv2mv_input_view.mp4 \
--num_input_frames 1 \
--condition_location "first_cam" \
--prompt "${PROMPT}" \
--prompt_left "${PROMPT_LEFT}" \
--prompt_right "${PROMPT_RIGHT}" \
--prompt_back "${PROMPT_BACK}" \
--prompt_back_left "${PROMPT_BACK_LEFT}" \
--prompt_back_right "${PROMPT_BACK_RIGHT}" \
--video_save_name diffusion-single2multiview-text2world
- Cosmos-Drive-Dreams is a workflow for generating challenging driving conditions for AVs. The Cosmos Drive models have been post-trained for the driving domain to generate driving data that is multi-view, high-fidelity, and spatiotemporally consistent. The generated multiview data is then amplified with a post-trained Cosmos Transfer model to improve generalization in low-visibility conditions such as fog and rain, for tasks like 3D lane detection, 3D object detection, and driving policy learning.

- Isaac GR00T-Dreams, based on DreamGen research, is a blueprint for large-scale synthetic trajectory data generation, a real-to-real data workflow for humanoid robot training. GR00T-Dreams uses Cosmos Predict to create diverse, photorealistic videos of robots performing tasks. It does this from image and text prompts, and extracts action data, called neural trajectories, for training robot policies. This helps train robots on new skills and adapt to different environments with minimal human demonstrations.

Example of post-training GR00T on GR1 data:
EXP=predict2_video2world_training_2b_groot_gr1_480
torchrun --nproc_per_node=8 --master_port=12341 -m scripts.train --config=cosmos_predict2/configs/base/config.py -- experiment=${EXP}
- DiffusionRenderer is a neural rendering framework that enables photorealistic relighting, material editing, and object insertion from a single video input without requiring explicit 3D geometry or lighting data. It leverages video diffusion models to estimate scene properties, then generates realistic new images. Using Cosmos Predict’s diffusion model improves the quality of DiffusionRenderer’s lighting capability, enabling more accurate and temporally consistent results. This is helpful for physical AI simulation, since it makes scene editing highly efficient and controllable.


Here is a sample command for video re-lighting. This applies novel lighting to frames from the inverse renderer and generates relit video frames:
CUDA_HOME=$CONDA_PREFIX PYTHONPATH=$(pwd) python cosmos_predict1/diffusion/inference/inference_forward_renderer.py \
--checkpoint_dir checkpoints \
--diffusion_transformer_dir Diffusion_Renderer_Forward_Cosmos_7B \
--dataset_path=asset/example_results/video_delighting/gbuffer_frames \
--num_video_frames 57 \
--envlight_ind 0 1 2 3 \
--use_custom_envmap=True \
--video_save_folder=asset/example_results/video_relighting/
- Accelerated video generation Cosmos-Predict2 now uses Neighborhood Attention (NATTEN), improving its focus on relevant video regions. This attention system is layer-adaptive and dynamically balances global and local context for optimal speed and quality. By implementing sparse attention within model layers, unnecessary computation during video generation is minimized. NATTEN’s efficiency is further boosted by hardware-optimized backend code, specifically designed for NVIDIA hardware. As a result, video inference is 2 to 2.5 times faster on advanced GPUs such as the NVIDIA H100 and NVIDIA B200.
Cosmos Transfer: controlled synthetic data generation for robotics and AVs
Cosmos Transfer models generate world simulations based on multiple control inputs like segmentation maps, depth, edge maps, lidar scans, keypoints, and HD maps. These different modalities enable users to control scene composition while generating diverse visual features via user text prompts. The aim is to augment synthetic datasets with large visual diversity and improve overall sim-to-real transfer in robotics and autonomous driving applications.
Cosmos Transfer applications
Let’s now take a look at some workflows that use Cosmos Transfer.
- Cosmos Transfer for AVs generates new conditions, such as weather, lighting, and terrain, from a single driving scenario using different text prompts. It uses multimodal controls as inputs to amplify data variation, such as in the Cosmos Drive Dreams use case. This is helpful when creating AV training datasets because it can scale up data generation from a single video, based on user text prompts.

Example command using Cosmos Transfer to generate an RGB video from a text prompt and HD Map condition video:
export CUDA_VISIBLE_DEVICES="${CUDA_VISIBLE_DEVICES:=0}"
export CHECKPOINT_DIR="${CHECKPOINT_DIR:=./checkpoints}"
export NUM_GPU="${NUM_GPU:=1}"
PYTHONPATH=$(pwd) torchrun --nproc_per_node=$NUM_GPU --nnodes=1 --node_rank=0 cosmos_transfer1/diffusion/inference/transfer.py \
--checkpoint_dir $CHECKPOINT_DIR \
--video_save_folder outputs/example1_single_control_edge_distilled \
--controlnet_specs assets/inference_cosmos_transfer1_single_control_edge.json \
--offload_text_encoder_model \
--offload_guardrail_models \
--num_gpus $NUM_GPU \
--use_distilled
- Edge model distillation is an improved version of Cosmos Transfer. The original Cosmos Transfer model required 70 passes to generate a video, incurring large computational costs. Model distillation, for the edge modality, has produced a smaller student model that performs the same task in a single step, closely matching the original model’s quality. Other control modalities (like depth, segmentation, HDMap, and Lidar) can be distilled similarly, achieving performance improvements. Reducing computational work for video generation enables faster and more affordable deployment. The distilled variant can be enabled through the
--use_distilled
flag:
Cosmos Reason: long-horizon reasoning for physical AI
As a world foundation model focused on reasoning for physical AI, Cosmos Reason understands physical common sense and generates appropriate embodied decisions through long chain-of-thought reasoning. This is useful for curating high-quality training data by using Cosmos Reason as a critic during SDG, as it understands action sequences and real-world constraints. The model has been trained in two stages: supervised fine-tuning (SFT) and reinforcement learning.

SFT training can improve the Reason model’s performance on specific tasks. For example, training with the robovqa dataset can improve performance on robotics visual question answering use-cases. Here is an example command to launch SFT training:
cosmos-rl --config configs/cosmos-reason1-7b-fsdp2-sft.toml
./tools/dataset/cosmos_sft.py
Getting started
Check out the following resources to learn more:
- Cosmos Predict2: Project Website, GitHub, Hugging Face, Paper
- Cosmos Transfer1: Project Website, GitHub, Hugging Face, Paper
- Cosmos Reason1: Project Website, GitHub, Hugging Face, Paper
- Isaac GR00T-Dreams: GitHub, Paper
- Cosmos-Drive-Dreams: Project Website, GitHub, Paper, Dataset
- DiffusionRenderer: Project Website, GitHub, Paper, Hugging Face
Experience the next era of world foundation models with NVIDIA at SIGGRAPH 2025:
- A special address on Monday, Aug. 11, with NVIDIA AI research leaders Sanja Fidler, Aaron Lefohn and Ming-Yu Liu, who’ll chart the next frontier in computer graphics and physical AI.
- Hands-on: Learn to use NVIDIA Cosmos, a platform of generative world foundation models, to generate data and scenarios for training physical AI.
This post is part of our NVIDIA Robotics Research and Development Digest (R2D2) to give developers deeper insight into the latest breakthroughs from NVIDIA Research across physical AI and robotics applications.
Stay up to date by subscribing to the newsletter and following NVIDIA Robotics on YouTube, Discord, and developer forums. To start your robotics journey, enroll in free NVIDIA Robotics Fundamentals courses.
Acknowledgments
For their contributions to the research mentioned in this post, thanks to Niket Agarwal, Arslan Ali, Mousavian Arsalan, Alisson Azzolini, Yogesh Balaji, Hannah Brandon, Tiffany Cai, Tianshi Cao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yin Cui, Ying Cui, Yifan Ding, Daniel Dworakowski, Francesco Ferroni, Sanja Fidler, Dieter Fox, Ruiyuan Gao, Songwei Ge, Rama Govindaraju, Siddharth Gururani, Zekun Hao, Ali Hassani, Ethan He, Fengyuan Hu, Shengyu Huang, Spencer Huang, Michael Isaev, Pooya Jannaty, Brendan Johnson, Alexander Keller, Rizwan Khan, Seung Wook Kim, Gergely Klár, Grace Lam, Shiyi Lan, Elena Lantz, Tobias Lasser, Nayeon Lee, Anqi Li, Zhaoshuo Li, Chen-Hsuan Lin, Tsung-Yi Lin, Zhi-Hao Lin, Zongyu Lin., Ming-Yu Liu, Xian Liu, Xiangyu Lu, Yifan Lu, Alice Luo, Ajay Mandlekar, Hanzi Mao, Andrew Mathau, Seungjun Nah, Avnish Narayan, Yun Ni, Sriprasad Niverty, Despoina Paschalidou, Tobias Pfaff, Wei Ping, Morteza Ramezanali, Fabio Ramos, Fitsum Reda, Zheng Ruiyuan, Amirmojtaba Sabour, Ed Schmerling, Tianchang Shen, Stella Shi, Misha Smelyanskiy, Shuran Song, Bartosz Stefaniak, Steven Sun, Xinglong Sun, Shitao Tang, Przemek Tredak, Wei-Cheng Tseng, Nandita Vijaykumar, Andrew Z. Wang, Guanzhi Wang, Ting-Chun Wang, Zian Wang, Fangyin Wei, Xinyue Wei, Wen Xiao, Stella Xu, Yao Xu, Yinzhen Xu, Dinghao Yang, Xiaodong Yang, Zhuolin Yang, Seonghyeon Ye, Yuchong Ye, Xiaohui Zeng, Yuxuan Zhang, Zhe Zhang, Ruijie Zheng, Yuke Zhu, and Artur Zolkowski.