Building smarter robots and autonomous vehicles (AVs) starts with physical AI models that understand real-world dynamics. These models serve two critical roles: accelerating synthetic data generation (SDG) to help autonomous machines learn about real-world physics and interactions—including rare edge cases—and serving as base models that can be post-trained for specialized tasks or adapted to different output types.
Cosmos Predict-1 was built for this, able to generate realistic, physics-aware future world states.
Now, the new Cosmos Predict-2 introduces major upgrades in speed, visual quality, and customization. In this post, you’ll learn about the model and how to post-train it for domain-specific use cases.
Cosmos Predict-2
Cosmos Predict-2 is a top-performing world foundation model with architectural refinements that improve speed, scalability, and provide resolution and framerate flexibility across use cases and hardware platforms. There are two model variants optimized for task complexity:
- Cosmos Predict-2 2B: Offers faster inference and lower memory usage compared to Predict-1, ideal for prototyping, low-latency applications, and edge deployments.
- Cosmos Predict-2 14B: Designed for high-fidelity world modeling tasks that demand complex scene understanding, extended temporal coherence, and prompt precision.
Developers can start by generating a preview using the text-to-image model, which then conditions the video2world model to produce consistent, physically accurate world states as video. This accelerates iterative prompting and scenario design.


Cosmos Predict-2 will soon provide multiple resolution and multiple framerate options, as detailed below:
- Resolution: Supports 704p (~ 720p) and 480p. The 480p option offers faster throughput when high-resolution isn’t needed.
- Framerate: 10 fps and 16 fps are available now, with 24 fps support coming soon—ideal for 10 Hz simulation and AV training pipelines.
Inference and performance optimizations
Cosmos Predict-2 is designed for fast, flexible inference across a range of hardware and use cases.
For quick prototyping or low-latency applications, the 2B model variant delivers fast performance—generating image previews in under 5 seconds on NVIDIA GPUs like NVIDIA GB200 NVL72, NVIDIA DGX B200, and NVIDIA RTX PRO 6000. For more complex tasks requiring higher fidelity and temporal coherence, the 14B variant enhances quality while still achieving fast turnaround on GB200 and B200 systems.
For full setup instructions, visit the nvidia-cosmos/cosmos-predict2 GitHub repository.
Post-training Cosmos models for downstream foundation models
Developers can post-train Cosmos Predict-2 to specialize in applications like robotics, AVs, and industrial automation. This section breaks down how to post-train the model for robotics, AVs, and industrial applications using the GR00T-Dreams blueprint as a case study. It also details evaluation methods to ensure optimal performance.
Follow the steps in this section to post-train the model and generate custom synthetic training data for the example task of picking an apple.
Domain | Hardware-specific manipulation | Example application |
Robotics | Instruction control, object manipulation | Adapting a robot arm to pick apples with varying stem strength |
AVs | Multiview generation, edge-case simulation | Simulating rainy highway driving with lidar/camera sync |
Industrial | Action-conditioned workflows | Predictive maintenance for conveyor belt robots |
Vision | Camera pose conditioning | 3D-consistent video from single images |
Table 1. Cosmos Predict-2 post-training use cases, highlighting hardware-specific manipulation and example applications in robotics, autonomous vehicles, industrial automation, and vision
Step 1: Prepare the data
Collect ~100 hours of teleoperation video. Use the data curator to segment clips. Ensure the data reflects your setup—robot model, lighting, and object types and is a text plus visual pairing.
For captions, developers can use any visual language model, including Cosmos Reason (see Step 4 for details).
Step 2: Post-train the model
Use the curated video-text pairs to post-train Cosmos Predict-2 on your specific task and environment. Use post-training scripts from the nvidia-cosmos/cosmos-predict2 GitHub repo.
Step 3: Generate synthetic scenarios
Prompt the model with text such as “Pick up the bruised apple under low light.” You can also prompt the model with an initial image to create domain-specific “dream” videos.
Step 4: Validate for physical accuracy
Cosmos Reason is an open, spatiotemporally aware reasoning model that interprets visual input with text prompt, performs chain-of-thought reasoning and generates optimal text decisions or captions. It helps evaluate generated data. In this example, it critiques generated data or “dreams.” For example:
- Does the robot grasp the apple properly?
- Are joint angles within limits?
- Are there any object collisions or motion artifacts?

The post-train, generate, validate loop enables iterative improvement of synthetic data quality and downstream model performance.
Developers can also use Cosmos Transfer to expand their datasets by adding variety—like different environments or lighting conditions—based on structured inputs or simulations created in NVIDIA Omniverse. Find more information on using Cosmos Transfer for synthetic dataset augmentation.
How NVIDIA Research uses Cosmos Predict
NVIDIA Research is leveraging Cosmos Predict-1 for advanced video and 3D applications. The DiffusionRenderer method, integrated into Cosmos to combine high-quality synthetic data and real-world video to improve lighting realism, geometry, and material accuracy in long video sequences provides a general-purpose framework for video lighting control, randomization, and editing.

Difix3D+, a one-step diffusion model, enhances 3D reconstruction and novel view synthesis across NeRF and 3DGS pipelines. Integrated with Cosmos Predict-1, it improves temporal consistency, reduces flicker, and sharpens details—addressing key challenges in high-framerate rendering.
NVIDIA Research also built a synthetic data generation pipeline for AV development—referred to as Cosmos-Drive-Dreams—based on Cosmos Transfer and Cosmos Predict-1. The two models generate diverse driving videos conditioned on HDMaps, LiDARdepth, and text prompts, enabling realistic scenes under diverse conditions and can extend from single-view into multi-view consistent videos.
Get started with Cosmos Predict-2
Cosmos Predict-2 marks a significant leap forward in generating physics-aware, high-fidelity synthetic data for robotics, vision, and autonomous systems. With faster inference, scalable performance, and flexible resolution and framerate options, it’s built to adapt across diverse domains and hardware platforms.
Paired with other world foundation models in the Cosmos family including Cosmos Reason for physical AI reasoning and Cosmos Transfer for augmentation, it enables a complete loop—post-train, generate, validate, and refine. This accelerates the development of domain-specific models and smarter, safer physical AI systems.
Experiment with Cosmos Predict-2 on GitHub. It includes inference and post-training scripts for running open model checkpoints from Hugging Face. Visit the nvidia-cosmos GitHub repo for more information.
Follow NVIDIA on Hugging Face to get notified about new open model releases.
Watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang at VivaTech 2025, and explore GTC Paris sessions.
NVIDIA Cosmos and NVIDIA Omniverse are advancing physical AI. Stay up to date by subscribing to NVIDIA news, and connect with the Omniverse Developer Community for livestreams on leading physical AI advancements.
Get started with Omniverse developer starter kits to quickly develop and enhance your own applications and services.