Computer Vision / Video Analytics

Maximize Robotics Performance by Post-Training NVIDIA Cosmos Reason

Decorative image showing VLMs.

First unveiled at NVIDIA GTC 2025, NVIDIA Cosmos Reason is an open and fully customizable reasoning vision language model (VLM) for physical AI and robotics. The VLM enables robots and vision AI agents to reason using prior knowledge, physics understanding, and common sense to understand and act in the real world. 

Given a video and a text prompt, the system first turns the video into tokens using a vision encoder and a special translator called a projector. These video tokens are combined with the text prompt and fed into the core model, which uses a mix of LLM modules and techniques. The model thinks step-by-step and gives detailed, logical responses. 

Cosmos Reason is built using supervised fine-tuning and reinforcement learning to bridge multimodal perception and real-world decision-making. It uses chain-of-thought reasoning capabilities to understand world dynamics without human annotations.

Fine-tuning on physical AI tasks boosts Cosmos Reason’s base model performance by over 10%, with reinforcement learning adding another 5% gain, enabling the model to achieve a 65.7 average score across key benchmarks in robotics and autonomous vehicle applications.

A diagram showing the Cosmos Reason process: on the left, video and text inputs enter the model as tokens; in the center, the model processes the information and generates step-by-step reasoning with a large language model backbone to produce a final text response for real-world decision-making.
Figure 1. Cosmos Reason takes in video and text, thinks step-by-step, and makes optimal decisions through reinforcement learning

Use cases for Cosmos Reason

Some of the robotics and physical AI applications include: 

Video 1. An example of robot planning and reasoning

How to use Cosmos Reason 

Developers can download the model checkpoints from Hugging Face and get the inference scripts and post-training from GitHub

The model can take in videos at different resolutions and frame rates, along with a text prompt that specifies the developer’s intent, such as a question or explanation, guiding the model to reason and respond accordingly. Developers can also use the prompt upsampler model to improve text prompts.

Here is a snippet showing inference using Cosmos Reason on the following video:

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

# You can also replace the MODEL_PATH by a safetensors folder path mentioned above
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_tokens=4096,
)

video_messages = [
    {"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
    {"role": "user", "content": [
            {"type": "text", "text": (
                    "Is it safe to turn right?"
                )
            },
            {
                "type": "video", 
                "video": "assets/sample.mp4",
                "fps": 4,
            }
        ]
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

Fine-tuning Cosmos Reason

Supervised fine-tuning (SFT) can improve a model’s capability on certain tasks. For example, training with the robovqa dataset can improve model performance on robotics-specific visual question answering scenarios.

FPS = 1
MAX_PIXELS = 81920

class CosmosSFTDataset(Dataset):
    def setup(self, config: Config, tokenizer: AutoTokenizer, *args, **kwargs):
        """
        Called by launcher after being mounted
        """
        self.config = config
        self.tokenizer = tokenizer

        if config.train.train_policy.dataset.split:
            if isinstance(config.train.train_policy.dataset.split, list):
                dataset_list = []
                for split_name in config.train.train_policy.dataset.split:
                    dataset_list.append(self.dataset[split_name])
                self.dataset = ConcatDataset(dataset_list)
            else:
                assert isinstance(config.train.train_policy.dataset.split, str)
                self.dataset = self.dataset[config.train.train_policy.dataset.split]

        # get multi-modal files paths
        cosmos_cache_dir = os.environ.get(
            "COSMOS_CACHE", os.path.join(os.path.expanduser("~"), ".cache/cosmos/")
        )
        video_clips_path = os.path.join(
            cosmos_cache_dir,
            "datasets",
            basename_from_modelpath(config.train.train_policy.dataset.name),
            config.train.train_policy.dataset.subset,
            "video_clips",
        )

    def __getitem__(self, idx: int) -> tuple[str, str]:
        """
        Return a tuple of (prompt, reference answer)
        """
        payload = self.dataset[idx]
        conversations = copy.deepcopy(payload["conversations"])

        for conv in conversations:
            if conv["role"] == "user":
                assert isinstance(conv["content"], str), "User message must be string"
                # Rewrite to support image/video tokens
                content = [
                    {
                        "type": "video",
                        "video": self.mm_files_paths[payload["video"].split("/")[-1]],
                        "max_pixels": MAX_PIXELS,
                        "fps": FPS,
                    },
                    {
                        "type": "text",
                        "text": conv["content"],
                    },
                ]
                conv["content"] = content

        return conversations

Find more information and fine-tuning scripts on GitHub.

Cosmos Reason is optimized to perform best on NVIDIA GPUs. To run the models, developers can set up a Docker environment or run it in their environment. 

For vision AI pipelines, developers can use the VLM from edge to cloud, running on GPUs like NVIDIA DGX Spark, NVIDIA RTX Pro 6000, NVIDIA AI H100 Tensor Core GPUs, or NVIDIA Blackwell GB200 NVL72 on NVIDIA DGX Cloud. 

Get started

Explore Cosmos documentation for in-depth tutorials, implementation details, and practical use cases, along with the following resources:

Stay up to date by subscribing to NVIDIA news, following NVIDIA AI  on LinkedIn, Instagram, X and Facebook, and joining the NVIDIA Cosmos Reason forum.

Discuss (0)

Tags