Generative AI

Visual Language Intelligence and Edge AI 2.0 with NVIDIA Cosmos Nemotron

Decorative image of VILA and Jetson Orin workflow.

Note: As of January 6, 2025, VILA is now part of the Cosmos Nemotron VLM family.

NVIDIA is proud to announce the release of NVIDIA Cosmos Nemotron, a family of state-of-the-art vision language models (VLMs) designed to query and summarize images and videos from physical or virtual environments. Cosmos Nemotron builds upon NVIDIA’s groundbreaking visual understanding research including VILA, NVILA, NVLM and more. This new model family represents a significant advancement in our multimodal AI capabilities and the incorporation of innovations such as multi-image analysis, video understanding,  spatial-temporal reasoning , in-context learning, and zero/few-shot tasks.

In this post, we describe how Cosmos Nemotron performs against other models to deliver edge AI 2.0. 

Initial versions of edge AI involved deploying compressed AI models onto edge devices. This phase, known as Edge AI 1.0, focused on task-specific models. The challenge with this approach lay in the need to train different models with different datasets, where negative samples are hard to collect and outlier situations are difficult to handle. This process was time-consuming and highlighted the need for more adaptable AI solutions with better generalization.

‍Edge AI 2.0: The rise of generative AI

Edge AI 2.0 marks a shift towards enhanced generalization, powered by foundational visual language models (VLMs). 

VLMs such as Cosmos Nemotron demonstrate incredible versatility, understanding complex instructions and swiftly adapting to new scenarios. This flexibility positions them as vital tools in a wide array of applications. They can optimize decision-making in self-driving vehicles, create personalized interactions within IoT and AIoT environments, event detection, and enhance smart home experiences. 

The core strength of VLMs lies in their world knowledge acquired during language pre-training and the ability for users to query them with natural language. This leads to dynamic processing abilities for AI-powered smart cameras without needing to hardcode bespoke vision pipelines.

VLM on the edge: Cosmos Nemotron and NVIDIA Jetson Orin

To accomplish edge AI 2.0, a VLM must be high-performance and easy to deploy. Cosmos Nemotron achieves both by using the following:

  • A carefully designed training pipeline with a high-quality data mixture
  • AWQ 4bit quantization with negligible accuracy loss
Diagram shows a vision encoder, projector, and LLM. The training recipe contains three stages: train the projector, do interleaved pre-training, and vision-text joint SFT.
Figure 1. Cosmos Nemotron model architecture and training recipe

Cosmos Nemotron is a visual language model that brings visual information into LLMs. The Cosmos Nemotron model consists of a visual encoder, LLM, and projector that bridges the embeddings from the two modalities. To leverage powerful LLMs, Cosmos Nemotron uses a visual encoder to encode images or video as visual tokens and then input these visual tokens into LLM as if they are a foreign language. This design can handle an arbitrary number of interleaved image-text inputs.

Cosmos Nemotron’s success stems from its enhanced pretraining recipe. We observed three major findings after ablated study on visual language model pretraining choices: 

  • Freezing LLMs during pre-training can achieve decent zero-shot performance but lacks in-context learning capability, which requires unfreezing the LLM.
  • Interleaved pre-training data is beneficial, whereas image-text pairs alone are not optimal. 
  • Re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks but also boosts VLM task accuracy.

We observed that the pre-training process unlocked several interesting capabilities for the model: 

  • Multi-image reasoning, despite the model only seeing single image-text pairs during SFT (supervised fine tuning)
  • Stronger in-context learning capabilities
  • Enhanced world knowledge

For more information, see the On Pre-training for Visual Language Models  and Efficient Frontier Visual Language Models paper, and the NVLabs/cosmos-nemotron GitHub repo.

NVIDIA Jetson Orin offers unparalleled AI compute, large unified memory, and comprehensive AI software stacks, making it the perfect platform to deploy Cosmos Nemotron on energy-efficiency edge devices. Jetson Orin is capable of fast-inferencing any generative AI models powered by the transformer architecture, leading the edge performance on MLPerf

AWQ quantization 

To deploy Cosmos Nemotron on Jetson Orin, we integrated Activation-aware Weight Quantization (AWQ) to enable 4-bit quantization. AWQ enables us to quantize Cosmos Nemotron to 4-bit precision with negligible accuracy loss, paving the way for VLMs to transform edge computing while upholding performance standards.

Despite advancements like AWQ, deploying large language and visual models on edge devices remains a complex task. Four-bit weights lack byte alignment and demand specialized computation for optimal efficiency. 

TinyChat is an efficient inference framework designed specifically for LLMs and VLMs on edge devices. TinyChat’s adaptable nature enables it to run on various hardware platforms, from NVIDIA RTX 4070 laptop GPUs to NVIDIA Jetson Orin, attracting significant interest from the open-source community. 

Now, TinyChat expands its reach to support VILA, enabling the vital understanding and reasoning of visual data. TinyChat delivers exceptional efficiency and flexibility in combining textual and visual processing, empowering edge devices to execute cutting-edge, multi-modal tasks.

Benchmarks

The following tables show the benchmark results for VILA 1.5-3B. It does great for both image QA and video QA benchmarks at its size. You can also see that AWQ 4-bit quantization doesn’t lose accuracy and by integrating with Scaling on Scales (S2), it can perceive images with higher resolutions and further boost the performance.

ModelPrecisionVQA-V2VizWizGQAVQA – TScienceQA – IMMESEED- IMMMU valMMMU test
VILA-1.5-3B-S2fp1679.861.361.463.469.6143266.533.131.3
VILA1.5-3Bfp1680.453.561.560.469144267.933.330.8
VILA1.5-3Bint48053.861.160.467.8143766.632.731.1
Table 1. Model evaluation results on image QA benchmarks (before/after quantization)
ModelActivityNetMSVDMSR-VTTTGIFPerception Test
VILA1.5-3B50.276.657.551.739.3
Table 2. Model evaluation results on Video QA benchmarks

Deploying on Jetson Orin and NVIDIA RTX

With the increasing prevalence of cameras and vision systems being used in real-world environments, inferencing Cosmos Nemotron on edge devices is an important task. Depending on the model size, you can choose from the seven Jetson Orin modules ranging from entry AI to high-performance. This gives you the ultimate flexibility to build generative AI applications for smart home devices, medical instruments, autonomous robots, and video analytics that users can reconfigure and query dynamically.

Figure 3 shows the end-to-end multimodal pipeline performance for running Cosmos Nemotron on Jetson AGX Orin and Jetson Orin Nano, with both achieving interactive rates on video streams. 

VILA1.5 2.7B runs up to 7.5 frames per second on Jetson AGX Orin, much faster than other works. Jetson Orin Nano 8GB runs up to the 7B/8B models.
Figure 3. Cosmos Nemotron inference speed comparison

These benchmarks include the overall time to query a frame, including vision encoding (with CLIP or SigLIP), multimodal projection, assembly of the chat embeddings, and generation of the language model output with 4-bit quantization. The VILA-1.5 models include a novel adaptation that reduces the number of tokens used to represent each image embedding from 729 down to 196 tokens, which boosts performance while retaining accuracy under increased spatial resolution in the vision encoder. 

This highly optimized VLM pipeline is open source and integrates advanced features such as multimodal RAG with one-shot image tagging, along with efficient re-use of the image embeddings for other vision-related tasks across the system.

In this GIF, VILA1.5-2.7B reasons about a building’s condition.
Figure 4. VILA1.5-2.7B (4-bit) running on Jetson Orin

Consumer GPU experiences

Cosmos Nemotron can also be deployed in consumer GPUs such as NVIDIA RTX on laptops and PC workstations to enhance user productivity and interaction experiences.

In this GIF, VILA1.5-2.7B reasons about two images of temperature changes.
Figure 5. VILA1.5-2.7B (4-bit) running on NVIDIA RTX 4090

Multi-image reasoning

TinyChat’s newest release uses Cosmos Nemotron’s impressive multi-image reasoning capabilities, enabling you to upload multiple images simultaneously for enhanced interactions. This unlocks exciting possibilities. 

Figure 6 shows that Cosmo Nemotron can understand the content and order of image sequences, opening new avenues for creative applications.

GIF shows VILA reasoning about three images and figuring out what the user had for lunch and the time.
Figure 6. VILA1.5-2.7B (4-bit) on multi-image understanding

In-context learning

Cosmos Nemotron also demonstrates remarkable in-context learning abilities. Without the need for explicit system prompts, Cosmos Nemotron can seamlessly infer patterns from previous image-text pairs to generate relevant text for new image inputs. 

In Figure 7, Cosmos Nemotron successfully recognizes the NVIDIA logo and, mirroring the style of previous examples, outputs NVIDIA’s most famous products.

GIF shows that VILA1.5-2.7B figured out what NVIDIA is famous for, given the example of Google and Apple.
Figure 7. VILA1.5-2.7B (4-bit) on an in-context learning task

Get started with Cosmos Nemotron

We plan to continue to innovate on Cosmos Nemotron, including extending context length, increasing resolution, and curating a better dataset for vision and language alignment.

For more information about this family of models, see the following resources.

  • To get started with Cosmos Nemotron, see the /NVLabs/VILA GitHub repo.
  • For a multimodal web UI where you can speak to Cosmos Nemotron with ASR/TTS running on Jetson Orin, see the llamaspeak agent tutorial.
  • For streaming Cosmos Nemotron on a camera or video feed, see the Live Llava agent tutorial. 

For more ideas about generative AI at the edge, see the Jetson AI Lab, especially the following videos.

Video 1. JETSON AI LAB | Live Llava 2.0 – VILA + Multimodal NanoDB on Jetson Orin
Video 2. JETSON AI LAB | One-Shot Multimodal RAG on Jetson Orin
Discuss (1)

Tags