An exciting breakthrough in AI technology—Vision Language Models (VLMs)—offers a more dynamic and flexible method for video analysis. VLMs enable users to interact with image and video input using natural language, making the technology more accessible and adaptable. These models can run on the NVIDIA Jetson Orin edge AI platform or discrete GPUs through NIMs. This blog post explores how to build VLM-based Visual AI Agents that can run from edge to cloud.
What is a visual AI agent?
A visual AI agent is powered by a VLM where you can ask a broad range of questions in natural language and get insights that reflect true intent and context in a recorded or live video. These agents can be interacted with through easy-to-use REST APIs and integrated with other services and even mobile apps. This new generation of visual AI agents helps to summarize scenes, create a wide range of alerts, and extract actionable insights from videos using natural language.
NVIDIA Metropolis brings visual AI agent workflows, which are reference solutions that accelerate the development of your AI application powered by VLMs, to extract insights with contextual understanding from videos, whether deployed at the edge or cloud.
For cloud deployment, developers can use NVIDIA NIM, a set of inference microservices that include industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime, to power the visual AI Agents. Get started by visiting the API catalog to explore and try the foundation models directly from a browser. View examples of NIM-powered Visual AI Agents on the Metropolis NIM Workflows GitHub page.
This blog post focuses on the implementation for edge use cases on Jetson Orin and we’ll explore how to use a new feature of the NVIDIA JetPack SDK called Jetson Platform Services for edge deployment. We’ll build a generative AI-powered application capable of detecting events set by the user in natural language on live video streams and then notifying the user as shown in Figure 1.
Build visual AI Agents for the edge using Jetson Platform Services
Jetson Platform Services is a suite of prebuilt microservices that provide essential out-of-the-box functionality for building computer vision solutions on NVIDIA Jetson Orin. Included in these microservices are AI services with support for generative AI models such as zero-shot detection and state-of-the-art VLMs. Learn more about the feature highlights of Jetson Platform Services in this blog post.
VLMs combine a large language model with a vision transformer, enabling complex reasoning on text and visual input. This flexibility enables VLMs to be used for a variety of use cases and can be adapted on the fly by adjusting the prompts.
The VLM of choice on Jetson is VILA, given its SOTA reasoning capabilities and speed by optimizing the tokens per image. An overview of the VILA architecture and benchmark performance is shown in Figure 3.
Read more about VILA and its performance on Jetson, in the post Visual Language Intelligence and Edge AI 2.0.
While VLMs are fun to experiment with and enable interactive conversations on input images, it’s essential to apply this technology in practical scenarios.
It’s important to find ways to make large language models perform helpful tasks and incorporate them into larger systems. By combining VLMs with Jetson Platform Services, we can create a VLM-based visual AI agent application that detects events on a live-streaming camera and sends notifications to the user through a mobile app.
The application is powered by generative AI and uses several components from Jetson Platform Services. Figure 4 illustrates how these components work together to create the full system. It can also be used with the firewall, IoT gateway, and cloud services for secure remote access.
Building a VLM-based visual AI agent application
The following sections walk through the high-level steps of building a visual AI agent system with Jetson Platform Services. The full source code for this application is on GitHub.
VLM AI service
The first step is to build a microservice around the VLM.
VLM support on Jetson Orin is provided by the nanoLLM project. We can use the nanoLLM library to download, quantize and run VLMs on Jetson through a Python API, and turn it into a microservice as shown in Figure 4.
We take the following steps:
- Wrap the model in easy-to-call functions.
- Add REST API and WebSocket using FastAPI.
- Add RTSP stream input and output using mmj_utils.
- Output metadata to desired channels such as Prometheus, Websocket, or Redis.
The microservice then has a main loop that retrieves a frame, updates the prompt from the REST API, calls the model, and then outputs the results. This is captured in the following pseudocode:
# Add REST API
api_server = APIServer(prompt_queue)
api_server.start()
# Add Monitoring Metrics
prometheus_metric = Gauge()
prometheus.start_http_server()
# Add RTSP I/O
v_input = VideoSource(rtsp_input)
v_output = VideoOutput(rtsp_output)
# Load Model
Model = model.load()
While True:
#Update Image & Prompt
image = v_input.capture()
prompt = prompt_queue.get()
# Inference Model
model_output = predict(image, prompt)
# Generate outputs
metadata = generate_metadata(image, model_output)
overlay = generate_overlay(image, model_output)
# Output to Redis, Monitoring, RTSP
redis_server.xadd(metadata)
Prometheus_metric.set(metadata)
v_output.render(overlay)
We provide a utility library to use as a starting point for integrating many of these common components and full reference examples on GitHub.
Prompt engineering
VLMs are prompted with three main components: the system prompt, the user prompt, and the input frame as shown in Figure 5. We can adjust the system and user prompt of the VLM to teach it how to evaluate alerts on a livestream and output the results in a structured format that can be parsed and integrated with other services.
In this example, we use the system prompt to explain the output format and the goal of the model. The system prompt can be told that the user is supplying a list of alerts. The prompt will evaluate each alert as either true or false on the input frame and output the result in JSON format.
The user prompt can then be supplied through the REST API. An endpoint is exposed that enables query and alert input. The user input is combined with the system prompt and given to the VLM along with a frame from the input live stream. The VLM then evaluates the full prompt on the frame and generates a response.
This response is parsed and given as JSON, which we use to integrate with the alert monitoring service and WebSockets to track and send alerts to the mobile app.
Integration with Jetson Platform Services and a mobile app
The full end-to-end system can now come together and integrate with a mobile app to build the VLM-powered Visual AI Agent. Figure 6 shows the architecture diagram for the VLM, Jetson Platform Services, Cloud, and mobile app.
To get video input for the VLM, the Jetson Platform Services networking service and VST automatically discover and serve IP cameras connected to the network. These are automatically made available to the VLM service and mobile app through the VST REST APIs.
The APIs exposed by VST and the VLM service are accessed by the mobile app through the API Gateway. The mobile app can now use the VST APIs to get a list of live streams and present a preview of them to the user on the app’s home screen.
From the app, the user can then set custom alerts in natural language such as “Is there a fire” on their selected live stream. Once submitted, the app will call the stream control API from the VLM service to tell it which live-streaming camera to use as input. It’ll then call the alert API to set the alert rules for the VLM. Once these two requests are received, the VLM will begin evaluating the alert rules on the live stream.
When the VLM determines an alert is True, it outputs the alert state on a WebSocket connected to the mobile app. This will then trigger a popup notification on the mobile device that the user can click on to enter chat mode and ask follow-up questions.
As shown in Figure 7, the user can chat back and forth with the VLM about the input live stream and even directly view the live stream in the app using WebRTC from VST.
With the VLM, Jetson Platform Services, and mobile app, you can now set any custom alerts on live-streaming cameras connected to your Jetson and get real-time notifications.
Conclusion
This blog post discusses how VLMs can be combined with Jetson Platform Services to build a Visual AI Agent. Go to the Jetson Platform Services product page to get started. We provide a prebuilt container to launch the VLM AI service along with the prebuilt mobile app APK for Android.
The full source code for the VLM AI service is available on GitHub. This is a great reference to learn how to use VLMs and build your own microservices. For technical questions, visit the forum.
For more information, visit: