Robotics

Develop Custom Physical AI Foundation Models with NVIDIA Cosmos Predict-2

Jun 11, 2025

By Joel Pennington, Pranjali Joshi and Asawaree Bhide

Discuss (0)

AI-Generated Summary

Dislike

Cosmos Predict-2 is a world foundation model that generates realistic, physics-aware future world states with major upgrades in speed, visual quality, and customization compared to its predecessor, Cosmos Predict-1.
The model has two variants: Cosmos Predict-2 2B for faster inference and lower memory usage, and Cosmos Predict-2 14B for high-fidelity world modeling tasks that demand complex scene understanding and extended temporal coherence.
Developers can post-train Cosmos Predict-2 for domain-specific use cases such as robotics, autonomous vehicles (AVs), and industrial automation using the GR00T-Dreams blueprint, and validate the generated synthetic data using Cosmos Reason.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Building smarter robots and autonomous vehicles (AVs) starts with physical AI models that understand real-world dynamics. These models serve two critical roles: accelerating synthetic data generation (SDG) to help autonomous machines learn about real-world physics and interactions—including rare edge cases—and serving as base models that can be post-trained for specialized tasks or adapted to different output types.

Cosmos Predict-1 was built for this, able to generate realistic, physics-aware future world states.

Explore the NVIDIA Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.

Now, the new Cosmos Predict-2 introduces major upgrades in speed, visual quality, and customization. In this post, you’ll learn about the model and how to post-train it for domain-specific use cases.

Cosmos Predict-2

Cosmos Predict-2 is a top-performing world foundation model with architectural refinements that improve speed, scalability, and provide resolution and framerate flexibility across use cases and hardware platforms. There are two model variants optimized for task complexity:

Cosmos Predict-2 2B: Offers faster inference and lower memory usage compared to Predict-1, ideal for prototyping, low-latency applications, and edge deployments.
Cosmos Predict-2 14B: Designed for high-fidelity world modeling tasks that demand complex scene understanding, extended temporal coherence, and prompt precision.

Developers can start by generating a preview using the text-to-image model, which then conditions the video2world model to produce consistent, physically accurate world states as video. This accelerates iterative prompting and scenario design.

Cosmos Predict-2 will soon provide multiple resolution and multiple framerate options, as detailed below:

Resolution: Supports 704p (~ 720p) and 480p. The 480p option offers faster throughput when high-resolution isn’t needed.
Framerate: 10 fps and 16 fps are available now, with 24 fps support coming soon—ideal for 10 Hz simulation and AV training pipelines.

Inference and performance optimizations

Cosmos Predict-2 is designed for fast, flexible inference across a range of hardware and use cases.

For quick prototyping or low-latency applications, the 2B model variant delivers fast performance—generating image previews in under 5 seconds on NVIDIA GPUs like NVIDIA GB200 NVL72, NVIDIA DGX B200, and NVIDIA RTX PRO 6000. For more complex tasks requiring higher fidelity and temporal coherence, the 14B variant enhances quality while still achieving fast turnaround on GB200 and B200 systems.

Cosmos Predict-2 is now available on PyPI for easy installation:

uv venv --python 3.10 --allow-existing
uv pip install -U "cosmos-predict2[cu126]" --extra-index-url https://nvidia-cosmos.github.io/cosmos-dependencies/cu126_torch260/simple

Here is a snippet showing inference using Cosmos-Predict2-2B-Video2World. Download an input image link from here.

import torch
from imaginaire.constants import get_cosmos_predict2_video2world_checkpoint
from imaginaire.utils.io import save_image_or_video
from cosmos_predict2.configs.base.config_video2world import get_cosmos_predict2_video2world_pipeline
from cosmos_predict2.pipelines.video2world import Video2WorldPipeline

# Create the video generation pipeline.
pipe = Video2WorldPipeline.from_config(
    config=get_cosmos_predict2_video2world_pipeline(model_size="2B"),
    dit_path=get_cosmos_predict2_video2world_checkpoint(model_size="2B"),
)

# Specify the input image path and text prompt.
image_path = "assets/video2world/example_input.jpg"
prompt = "A high-definition video captures the precision of robotic welding in an industrial setting. The first frame showcases a robotic arm, equipped with a welding torch, positioned over a large metal structure. The welding process is in full swing, with bright sparks and intense light illuminating the scene, creating a vivid display of blue and white hues. A significant amount of smoke billows around the welding area, partially obscuring the view but emphasizing the heat and activity. The background reveals parts of the workshop environment, including a ventilation system and various pieces of machinery, indicating a busy and functional industrial workspace. As the video progresses, the robotic arm maintains its steady position, continuing the welding process and moving to its left. The welding torch consistently emits sparks and light, and the smoke continues to rise, diffusing slightly as it moves upward. The metal surface beneath the torch shows ongoing signs of heating and melting. The scene retains its industrial ambiance, with the welding sparks and smoke dominating the visual field, underscoring the ongoing nature of the welding operation."

# Run the video generation pipeline.
video = pipe(input_path=image_path, prompt=prompt)

# Save the resulting output video.
save_image_or_video(video, "output/test.mp4", fps=16)

For full setup instructions, visit the nvidia-cosmos/cosmos-predict2 GitHub repository.

Post-training Cosmos models for downstream foundation models

Developers can post-train Cosmos Predict-2 to specialize in applications like robotics, AVs, and industrial automation. This section breaks down how to post-train the model for robotics, AVs, and industrial applications using the GR00T-Dreams blueprint as a case study. It also details evaluation methods to ensure optimal performance.

Follow the steps in this section to post-train the model and generate custom synthetic training data for the example task of picking an apple.

Domain	Hardware-specific manipulation	Example application
Robotics	Instruction control, object manipulation	Adapting a robot arm to pick apples with varying stem strength
AVs	Multiview generation, edge-case simulation	Simulating rainy highway driving with lidar/camera sync
Industrial	Action-conditioned workflows	Predictive maintenance for conveyor belt robots
Vision	Camera pose conditioning	3D-consistent video from single images

Table 1. Cosmos Predict-2 post-training use cases, highlighting hardware-specific manipulation and example applications in robotics, autonomous vehicles, industrial automation, and vision

Step 1: Prepare the data

Collect ~100 hours of teleoperation video. Use the data curator to segment clips. Ensure the data reflects your setup—robot model, lighting, and object types and is a text plus visual pairing.

For captions, developers can use any visual language model, including Cosmos Reason (see Step 4 for details).

Step 2: Post-train the model

Use the curated video-text pairs to post-train Cosmos Predict-2 on your specific task and environment. Use post-training scripts from the nvidia-cosmos/cosmos-predict2 GitHub repo.

Step 3: Generate synthetic scenarios

Prompt the model with text such as “Pick up the bruised apple under low light.” You can also prompt the model with an initial image to create domain-specific “dream” videos.

Step 4: Validate for physical accuracy

Cosmos Reason is an open, spatiotemporally aware reasoning model that interprets visual input with text prompt, performs chain-of-thought reasoning and generates optimal text decisions or captions. It helps evaluate generated data. In this example, it critiques generated data or “dreams.” For example:

Does the robot grasp the apple properly?
Are joint angles within limits?
Are there any object collisions or motion artifacts?

The post-train, generate, validate loop enables iterative improvement of synthetic data quality and downstream model performance.

Developers can also use Cosmos Transfer to expand their datasets by adding variety—like different environments or lighting conditions—based on structured inputs or simulations created in NVIDIA Omniverse. Find more information on using Cosmos Transfer for synthetic dataset augmentation.

How NVIDIA Research uses Cosmos Predict

NVIDIA Research is leveraging Cosmos Predict-1 for advanced video and 3D applications. The DiffusionRenderer method, integrated into Cosmos to combine high-quality synthetic data and real-world video to improve lighting realism, geometry, and material accuracy in long video sequences provides a general-purpose framework for video lighting control, randomization, and editing.

Difix3D+, a one-step diffusion model, enhances 3D reconstruction and novel view synthesis across NeRF and 3DGS pipelines. Integrated with Cosmos Predict-1, it improves temporal consistency, reduces flicker, and sharpens details—addressing key challenges in high-framerate rendering.

NVIDIA Research also built a synthetic data generation pipeline for AV development—referred to as Cosmos-Drive-Dreams—based on Cosmos Transfer and Cosmos Predict-1. The two models generate diverse driving videos conditioned on HDMaps, LiDARdepth, and text prompts, enabling realistic scenes under diverse conditions and can extend from single-view into multi-view consistent videos.

Get started with Cosmos Predict-2

Cosmos Predict-2 marks a significant leap forward in generating physics-aware, high-fidelity synthetic data for robotics, vision, and autonomous systems. With faster inference, scalable performance, and flexible resolution and framerate options, it’s built to adapt across diverse domains and hardware platforms.

Paired with other world foundation models in the Cosmos family including Cosmos Reason for physical AI reasoning and Cosmos Transfer for augmentation, it enables a complete loop—post-train, generate, validate, and refine. This accelerates the development of domain-specific models and smarter, safer physical AI systems.

Experiment with Cosmos Predict-2 on GitHub. It includes inference and post-training scripts for running open model checkpoints from Hugging Face. Visit the nvidia-cosmos GitHub repo for more information.

Follow NVIDIA on Hugging Face to get notified about new open model releases.

Watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang at VivaTech 2025, and explore GTC Paris sessions.

NVIDIA Cosmos and NVIDIA Omniverse are advancing physical AI. Stay up to date by subscribing to NVIDIA news, and connect with the Omniverse Developer Community for livestreams on leading physical AI advancements.

Get started with Omniverse developer starter kits to quickly develop and enhance your own applications and services.

Note: Updated on Aug. 27 with information about Cosmos Predict-2 on PyPl.

Visit our Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.
Explore new open Cosmos models and datasets on Hugging Face and GitHub or try models on build.nvidia.com.
Be part of the community and join our Cosmos Discord channel.

Already using Cosmos? Learn more about how to contribute.

Discuss (0)

About the Authors

About Joel Pennington
Joel Pennington is the product manager for NVIDIA Studio, leading AI-powered creative applications. Previously, he managed NVIDIA Cosmos Reason, a physical AI foundation model for embodied AI systems in robotics and autonomous vehicles. Drawing on experience in feature film VFX (Disney, Avatar sequels at Autodesk), AAA game development (Electronic Arts), and a SaaS startup in real-time 3D analytics for heavy industries, he brings deep spatial computing expertise to tools that democratize professional-grade creativity for artists and creators.

View all posts by Joel Pennington

About Pranjali Joshi
Pranjali Joshi is the product marketing manager for NVIDIA Omniverse, focusing on core technologies and visual generative AI. She holds an M.Sc. degree in data science and marketing strategy from the University of Maryland and a B.Sc. in electronics engineering. Previously, she worked at Accenture and Hitachi Vantara in software development and technology marketing roles.

View all posts by Pranjali Joshi

About Asawaree Bhide
Asawaree Bhide is a technical marketing engineer at NVIDIA, working on robotics and deep learning applications on the Jetson platform. She did her master’s in computer science at Georgia Tech and is interested in solving complex perception tasks in autonomous navigation for embodied agents.

View all posts by Asawaree Bhide