Generative AI

Visual Language Intelligence and Edge AI 2.0

Decorative image of VILA and Jetson Orin workflow.

VILA is a family of high-performance vision language models developed by NVIDIA Research and MIT. The largest model comes with ~40B parameters and the smallest model comes with ~3B parameters. It is fully open source (including model checkpoints and even training code and training data). 

In this post, we describe how VILA performs against other models to deliver edge AI 2.0. 

Initial versions of edge AI involved deploying compressed AI models onto edge devices. This phase, known as Edge AI 1.0, focused on task-specific models. The challenge with this approach lay in the need to train different models with different datasets, where negative samples are hard to collect and outlier situations are difficult to handle. This process was time-consuming and highlighted the need for more adaptable AI solutions with better generalization.

‍Edge AI 2.0: The rise of generative AI

Edge AI 2.0 marks a shift towards enhanced generalization, powered by foundational visual language models (VLMs). 

VLMs such as VILA demonstrate incredible versatility, understanding complex instructions and swiftly adapting to new scenarios. This flexibility positions them as vital tools in a wide array of applications. They can optimize decision-making in self-driving vehicles, create personalized interactions within IoT and AIoT environments, event detection, and enhance smart home experiences. 

The core strength of VLMs lies in their world knowledge acquired during language pre-training and the ability for users to query them with natural language. This leads to dynamic processing abilities for AI-powered smart cameras without needing to hardcode bespoke vision pipelines.

VLM on the edge: VILA and NVIDIA Jetson Orin

To accomplish edge AI 2.0, a VLM must be high-performance and easy to deploy. VILA achieves both by using the following:

  • A carefully designed training pipeline with a high-quality data mixture
  • AWQ 4bit quantization with negligible accuracy loss
Diagram shows a vision encoder, projector, and LLM. The training recipe contains three stages: train the projector, do interleaved pre-training, and vision-text joint SFT.
Figure 1. VILA model architecture and training recipe

VILA is a visual language model that brings visual information into LLMs. The VILA model consists of a visual encoder, LLM, and projector that bridges the embeddings from the two modalities. To leverage powerful LLMs, VILA uses a visual encoder to encode images or video as visual tokens and then input these visual tokens into LLM as if they are a foreign language. This design can handle an arbitrary number of interleaved image-text inputs.

VILA’s success stems from its enhanced pretraining recipe. We observed three major findings after ablated study on visual language model pretraining choices: 

  • Freezing LLMs during pre-training can achieve decent zero-shot performance but lacks in-context learning capability, which requires unfreezing the LLM.
  • Interleaved pre-training data is beneficial, whereas image-text pairs alone are not optimal. 
  • Re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks but also boosts VLM task accuracy.

We observed that the pre-training process unlocked several interesting capabilities for the model: 

  • Multi-image reasoning, despite the model only seeing single image-text pairs during SFT (supervised fine tuning)
  • Stronger in-context learning capabilities
  • Enhanced world knowledge

For more information, see Visual Language Models on NVIDIA Hardware with VILA, the VILA: On Pre-training for Visual Language Models paper, and the /Efficient-Large-Model/VILA GitHub repo.

NVIDIA Jetson Orin offers unparalleled AI compute, large unified memory, and comprehensive AI software stacks, making it the perfect platform to deploy VILA on energy-efficiency edge devices. Jetson Orin is capable of fast-inferencing any generative AI models powered by the transformer architecture, leading the edge performance on MLPerf

AWQ quantization 

To deploy VILA on Jetson Orin, we integrated Activation-aware Weight Quantization (AWQ) to enable 4-bit quantization. AWQ enables us to quantize VILA to 4-bit precision with negligible accuracy loss, paving the way for VLMs to transform edge computing while upholding performance standards.

Despite advancements like AWQ, deploying large language and visual models on edge devices remains a complex task. Four-bit weights lack byte alignment and demand specialized computation for optimal efficiency. 

TinyChat is an efficient inference framework designed specifically for LLMs and VLMs on edge devices. TinyChat’s adaptable nature enables it to run on various hardware platforms, from NVIDIA RTX 4070 laptop GPUs to NVIDIA Jetson Orin, attracting significant interest from the open-source community. 

Now, TinyChat expands its reach to support VILA, enabling the vital understanding and reasoning of visual data. TinyChat delivers exceptional efficiency and flexibility in combining textual and visual processing, empowering edge devices to execute cutting-edge, multi-modal tasks.


The following tables show the benchmark results for VILA 1.5-3B. It does great for both image QA and video QA benchmarks at its size. You can also see that AWQ 4-bit quantization doesn’t lose accuracy and by integrating with Scaling on Scales (S2), it can perceive images with higher resolutions and further boost the performance.

ModelPrecisionVQA-V2VizWizGQAVQA – TScienceQA – IMMESEED- IMMMU valMMMU test
Table 1. Model evaluation results on image QA benchmarks (before/after quantization)
ModelActivityNetMSVDMSR-VTTTGIFPerception Test
Table 2. Model evaluation results on Video QA benchmarks

Deploying on Jetson Orin and NVIDIA RTX

With the increasing prevalence of cameras and vision systems being used in real-world environments, inferencing VILA on edge devices is an important task. Depending on the model size, you can choose from the seven Jetson Orin modules ranging from entry AI to high-performance. This gives you the ultimate flexibility to build generative AI applications for smart home devices, medical instruments, autonomous robots, and video analytics that users can reconfigure and query dynamically.

Figure 3 shows the end-to-end multimodal pipeline performance for running VILA on Jetson AGX Orin and Jetson Orin Nano, with both achieving interactive rates on video streams. 

VILA2.7B runs up to 7.5 frames per second on Jetson AGX Orin, much faster than other works. Jetson Orin Nano 8GB runs up to the 7B/8B models.
Figure 3. VILA inference speed comparison

These benchmarks include the overall time to query a frame, including vision encoding (with CLIP or SigLIP), multimodal projection, assembly of the chat embeddings, and generation of the language model output with 4-bit quantization. The VILA-1.5 models include a novel adaptation that reduces the number of tokens used to represent each image embedding from 729 down to 196 tokens, which boosts performance while retaining accuracy under increased spatial resolution in the vision encoder. 

This highly optimized VLM pipeline is open source and integrates advanced features such as multimodal RAG with one-shot image tagging, along with efficient re-use of the image embeddings for other vision-related tasks across the system.

In this GIF, VILA reasons about a building’s condition.
Figure 4. VILA-3B (4-bit) running on Jetson Orin

Consumer GPU experiences

VILA can also be deployed in consumer GPUs such as NVIDIA RTX on laptops and PC workstations to enhance user productivity and interaction experiences.

In this GIF, VILA reasons about two images of temperature changes.
Figure 5. VILA-3B (4-bit) running on NVIDIA RTX 4090

Multi-image reasoning

TinyChat’s newest release uses VILA’s impressive multi-image reasoning capabilities, enabling you to upload multiple images simultaneously for enhanced interactions. This unlocks exciting possibilities. 

Figure 6 shows that VILA can understand the content and order of image sequences, opening new avenues for creative applications.

GIF shows VILA reasoning about three images and figuring out what the user had for lunch and the time.
Figure 6. VILA-3B (4-bit) on multi-image understanding

In-context learning

VILA also demonstrates remarkable in-context learning abilities. Without the need for explicit system prompts, VILA can seamlessly infer patterns from previous image-text pairs to generate relevant text for new image inputs. 

In Figure 7, VILA successfully recognizes the NVIDIA logo and, mirroring the style of previous examples, outputs NVIDIA’s most famous products.

GIF shows that VILA figured out what NVIDIA is famous for, given the example of Google and Apple.
Figure 7. VILA-3B (4-bit) on an in-context learning task

Get started with VILA

We plan to continue to innovate on VILA, including extending context length, increasing resolution, and curating a better dataset for vision and language alignment.

For more information about this family of models, see the following resources.

For more ideas about generative AI at the edge, see the Jetson AI Lab, especially the following videos.

Video 1. JETSON AI LAB | Live Llava 2.0 – VILA + Multimodal NanoDB on Jetson Orin
Video 2. JETSON AI LAB | One-Shot Multimodal RAG on Jetson Orin
Discuss (1)