Simplify End-to-End Autonomous Vehicle Development with New NVIDIA Cosmos World Foundation Models

The shift to end-to-end planning models for powering autonomous vehicles (AVs) is increasing the demand for high-quality, physically-based sensor data. These models must have a general understanding of multi-modal datasets, along with the relationships between sensor datasets, vehicle trajectories, and driving actions to help with downstream training and validation tasks.

By adapting and post-training NVIDIA Cosmos world foundation models (WFMs)—Predict, Transfer, and Reason—to the AV domain, developers can create world models to accelerate end-to-end AV training. These models are for synthetic data generation (SDG), as shown in this post, as well as closed-loop training and in-vehicle inference.

Explore the NVIDIA Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.

In this post, we walk through different approaches to post-training. There are many ways to adapt Cosmos models to AV applications. The models we discuss are all currently available to developers.

Developing synthetic data generation pipelines on Cosmos

NVIDIA Research post-trained Cosmos WFMs on 20,000 hours of driving data to build a collection of models for AV development workflows. In a paper published at CVPR, researchers detailed how using data generated by Cosmos models improved performance in AV model training.

AV-specific models

Cosmos WFMs accelerate SDG for AV training, particularly through data augmentation using samples built on Cosmos-Transfer-1-7B-Sample-AV and Cosmos-Transfer-1-7B-Single2Multiview-Sample_AV. The Transfer model generates diverse driving videos conditioned on HD Maps, LIDAR depth, and text prompts, enabling realistic scenes under different conditions. It uses structured inputs like 3D cuboids, lane lines, road boundaries, and traffic elements to ensure precise, geometry-aware control. The multiview model then extends single-view videos into multi-view consistent videos. It is also possible to post-train Cosmos Transfer to address multi-view sensor generation, a process developers can apply to post–train their own versions.

A third model, a vision-language model (VLM) post-trained on a reasoning model like Cosmos Reason, performs automated rejection sampling to discard low-quality or unrealistic outputs, ensuring the high quality and realism of the generated synthetic dataset.

Synthetic data pipelines

When used together, these models form a pipeline that starts with text prompts and real-world data, and outputs high-fidelity, physically based multi-view videos.

Multiview generation helps solve common challenges, such as broken or occluded cameras. By generating multiview videos, developers can replace the video in the bad camera with a good one. It also enables the use of dashcam data, and developers can turn Internet video into data, mimicking their own AV development rig.

Video 1. Post-trained Cosmos model turns occluded camera video into usable data

The synthetic video data generated from this pipeline can mitigate long-tail distribution problems and enhance generalization in downstream tasks such as 3D lane detection, 3D object detection, and driving policy learning, particularly in challenging scenarios like extreme weather and nighttime conditions.

Attendees at CVPR 2025 this week can learn more about this project at the Embodied AI Workshop.

Developers can try this data for their own development, with 40,000 of the Cosmos-generated clips now available on the NVIDIA Physical AI Dataset.

Integrating Cosmos into existing AV workflows

Open-source simulators and AV companies have also post-trained Cosmos models on their own data, and have begun to integrate these models into their toolchains, opening up accelerated synthetic data generation pipelines to AV developers worldwide.

Cosmos Transfer

Announced at GTC Paris, the Cosmos Transfer NIM is a containerized version of Cosmos Transfer for accelerated inference. Developers can rapidly post-train and deploy Cosmos Transfer using NIM microservices to speed up their SDG workflow.

Video 2. Cosmos Transfer adds new weather variations to a driving scene in CARLA

Open source AV simulator CARLA will integrate Cosmos Transfer to augment simulation outputs, making physically based synthetic data generation available to a community of 150,000 developers. With the integration, users can generate endless high-quality video variations from CARLA sequences using simple prompts. This integration is in early access and will continue to be developed with community feedback.

Mcity, a public-private partnership for AV development and testing, is integrating Cosmos Transfer into the open source digital twin of its 32-acre physical test track. Developers using Mcity for research and development can quickly scale scenarios, adding new weather, lighting, and terrain.

Figure 3. Cosmos Transfer turns the Mcity test facility in Ann Arbor, Mich., into desert terrain

In addition, autonomous vehicle toolchain providers such as Foretellix and Parallel Domain have integrated Cosmos Transfer into their existing solutions. Voxel51, a visual AI data platform, provides the toolkit to manage, visualize, and refine the data generated by Cosmos Transfer. As a result, end customers can easily access the scale and variability of Cosmos Transfer without having to switch from their desired toolchain.

Finally, autonomous vehicle software company Oxa has integrated Cosmos Transfer into its own development toolchain, Oxa Foundry. Cosmos Transfer supports image and image sequence transformation for quick and easy synthesis, customized to specific use cases. This work has included different weather (snow, fog, rain) and illumination (night, dusk, dawn) transformations of real on-road and off-road data.

Cosmos Predict

Also announced at GTC Paris, Cosmos Predict-2 is our best-performing world foundation model yet for future world state prediction with higher fidelity, fewer hallucinations, and better text, object, and motion control in the video compared to Predict-1. The model will soon support multiple frame rates and resolutions and generate up to 30 seconds of video predicting what could happen next, specifically the physical interactions in the world guided by the image prompt.

Cosmos Predict-2 is built for customization; models can be easily post-trained on specific environments, tasks, or camera systems using curated data and tools like NVIDIA NeMo Curator and Cosmos Reason. In addition, Cosmos Predict-2 was pre-trained on AV data from Cosmos-Predict-7B-Single2Multiview-Sample_AV, facilitating faster post-training for the AV domain.

Autonomous trucking company Plus has post-trained Cosmos Predict-1 with extensive amounts of real-world driving data to create multiview videos that match the fidelity of actual video captured by truck cameras. These synthetic multiview videos can then be used to generate edge cases to rigorously test and validate the autonomous trucking system. Plus is also distilling world knowledge from Cosmos to improve end-to-end model performance and the ability to generalize in new ODDs.

Oxa is also using Cosmos Predict to support the generation of comprehensive multi-camera perspectives from around the vehicle, creating temporally consistent video footage across all these viewpoints.

The AV industry embraces end-to-end WFMs

As the AV industry adopts end-to-end foundation models, the need for vast, diverse, and physically accurate sensor data becomes critical. Real-world data alone cannot scale to meet the demands of safe and comprehensive training, especially across diverse operational domains and edge-case scenarios. Cosmos WFMs—Reason, Predict, and Transfer—close this gap by enabling developers to generate, expand, and customize high-fidelity data with unprecedented control and scalability.

Together, these models supercharge the AV development flywheel. Cosmos Predict introduces behavioral diversity and accelerates scenario expansion. Cosmos Transfer brings physical realism across environments. With open access and seamless integration into leading simulation platforms and toolchains, developers can unlock the full potential of end-to-end autonomy, paving the way for safer, smarter, and more scalable AV deployment.

Explore the NVIDIA research papers to be presented at CVPR 2025, and watch the NVIDIA GTC Paris keynote from NVIDIA founder and CEO Jensen Huang.

Stay up to date by subscribing to NVIDIA news and following NVIDIA Omniverse on Discord and YouTube.

Visit our Omniverse developer page to get all the essentials you need to get started
Access a collection of OpenUSD resources, including the new self-paced Learn OpenUSD training curriculum
Tune into upcoming OpenUSD Insiders livestreams and connect with the NVIDIA Developer Community

Get started with developer starter kits to quickly develop and enhance your own applications and services.

Visit our Cosmos Cookbook for step-by-step workflows, technical recipes, and concrete examples for building, adapting, and deploying Cosmos WFMs.
Explore new open Cosmos models and datasets on Hugging Face and GitHub or try models on build.nvidia.com.
Be part of the community and join our Cosmos Discord channel.