Maximize Robotics Performance by Post-Training NVIDIA Cosmos Reason

First unveiled at NVIDIA GTC 2025, NVIDIA Cosmos Reason is an open and fully customizable reasoning vision language model (VLM) for physical AI and robotics. The VLM enables robots and vision AI agents to reason using prior knowledge, physics understanding, and common sense to understand and act in the real world. Cosmos Reason has topped Physical Reasoning leaderboard on Hugging Face.

Given a video and a text prompt, the system first turns the video into tokens using a vision encoder and a special translator called a projector. These video tokens are combined with the text prompt and fed into the core model, which uses a mix of LLM modules and techniques. The model thinks step-by-step and gives detailed, logical responses.

Cosmos Reason is built using supervised fine-tuning and reinforcement learning to bridge multimodal perception and real-world decision-making. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.

Fine-tuning on physical AI tasks boosts Cosmos Reason’s base model performance by over 10%, with reinforcement learning adding another 5% gain, enabling the model to achieve a 65.7 average score across key benchmarks in robotics and autonomous vehicle applications.

A diagram showing the Cosmos Reason process: on the left, video and text inputs enter the model as tokens; in the center, the model processes the information and generates step-by-step reasoning with a large language model backbone to produce a final text response for real-world decision-making. — *Figure 1. Cosmos Reason takes in video and text, thinks step-by-step, and makes optimal decisions through reinforcement learning*

Use cases for Cosmos Reason

Some of the robotics and physical AI applications include:

Data curation and annotation enable developers to automate the filtering, critiquing, and annotation of massive and diverse training datasets.
Robot planning and reasoning for deliberate, methodical decision-making with a robot vision language action (VLA) model. Robots can interpret environments and, when given complex commands, break them down into tasks and execute them using common sense, even in unfamiliar environments.
Video analytics AI agents built with the NVIDIA Blueprint for video search and summarization can extract valuable insights and perform root-cause analysis on massive volumes of recorded or live video. This is ideal for analyzing city transit networks, factories, and warehouses.

Video 1. An example of robot planning and reasoning

How to use Cosmos Reason

Developers can download the model checkpoints from Hugging Face and get the inference scripts and post-training from GitHub.

The model can take in videos at different resolutions and frame rates, along with a text prompt that specifies the developer’s intent, such as a question or explanation, guiding the model to reason and respond accordingly. Developers can also use the prompt upsampler model to improve text prompts.

Here is a snippet showing inference using Cosmos Reason on the following video:

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

# You can also replace the MODEL_PATH by a safetensors folder path mentioned above
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_tokens=4096,
)

video_messages = [
    {"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
    {"role": "user", "content": [
            {"type": "text", "text": (
                    "Is it safe to turn right?"
                )
            },
            {
                "type": "video", 
                "video": "assets/sample.mp4",
                "fps": 4,
            }
        ]
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

Fine-tuning Cosmos Reason

Supervised fine-tuning (SFT) can improve a model’s capability on certain tasks. For example, training with the robovqa dataset can improve model performance on robotics-specific visual question answering scenarios.

FPS = 1
MAX_PIXELS = 81920

class CosmosSFTDataset(Dataset):
    def setup(self, config: Config, tokenizer: AutoTokenizer, *args, **kwargs):
        """
        Called by launcher after being mounted
        """
        self.config = config
        self.tokenizer = tokenizer

        if config.train.train_policy.dataset.split:
            if isinstance(config.train.train_policy.dataset.split, list):
                dataset_list = []
                for split_name in config.train.train_policy.dataset.split:
                    dataset_list.append(self.dataset[split_name])
                self.dataset = ConcatDataset(dataset_list)
            else:
                assert isinstance(config.train.train_policy.dataset.split, str)
                self.dataset = self.dataset[config.train.train_policy.dataset.split]

        # get multi-modal files paths
        cosmos_cache_dir = os.environ.get(
            "COSMOS_CACHE", os.path.join(os.path.expanduser("~"), ".cache/cosmos/")
        )
        video_clips_path = os.path.join(
            cosmos_cache_dir,
            "datasets",
            basename_from_modelpath(config.train.train_policy.dataset.name),
            config.train.train_policy.dataset.subset,
            "video_clips",
        )

    def __getitem__(self, idx: int) -> tuple[str, str]:
        """
        Return a tuple of (prompt, reference answer)
        """
        payload = self.dataset[idx]
        conversations = copy.deepcopy(payload["conversations"])

        for conv in conversations:
            if conv["role"] == "user":
                assert isinstance(conv["content"], str), "User message must be string"
                # Rewrite to support image/video tokens
                content = [
                    {
                        "type": "video",
                        "video": self.mm_files_paths[payload["video"].split("/")[-1]],
                        "max_pixels": MAX_PIXELS,
                        "fps": FPS,
                    },
                    {
                        "type": "text",
                        "text": conv["content"],
                    },
                ]
                conv["content"] = content

        return conversations

Find more information and fine-tuning scripts on GitHub.

Cosmos Reason is optimized to perform best on NVIDIA GPUs. To run the models, developers can set up a Docker environment or run it in their environment.

For vision AI pipelines, developers can use the VLM from edge to cloud, running on GPUs like NVIDIA DGX Spark, NVIDIA RTX Pro 6000, NVIDIA AI H100 Tensor Core GPUs, or NVIDIA Blackwell GB200 NVL72 on NVIDIA DGX Cloud.

Get started

Explore Cosmos documentation for in-depth tutorials, implementation details, and practical use cases, along with the following resources:

Try out the model on build.nvidia.com.
Download the model from Hugging Face to start experimenting with model checkpoints.
Access inference and post-training scripts on GitHub to customize with your data.

Stay up to date by subscribing to NVIDIA news, following NVIDIA AI on LinkedIn, Instagram, X and Facebook, and joining the NVIDIA Cosmos Reason forum.