In today’s data-driven world, organizations increasingly rely on video to capture critical information, yet extracting meaningful, real-time insights from massive amounts of footage remains a challenge. NVIDIA Metropolis Blueprint for video search and summarization (VSS) overcomes this hurdle by transforming millions of live video streams or hours of recorded video into instantly searchable, actionable intelligence.
VSS brings a reference architecture for building video analytics AI agents that perceive, reason, and act in real-time on massive volumes of live video streams and recorded data. It uses accelerated vision-based microservices, vision-language models (VLMs), large language models (LLMs), and retrievers for real-time video intelligence, agentic search, and automated reporting. VSS helps enterprises monitor operations, detect trends, and make informed decisions faster than ever. The latest version of VSS brings a new modular design, advanced fusion search capability and a set of skills to easily integrate with autonomous agents.
In this post you will learn how to use the new VSS skills with coding agents to automate VSS deployment and integration into custom applications, followed by a deep dive into the technology behind VSS 3. Continue reading to learn how to use VSS skills with coding agents for building autonomous video analytics AI Agents.
You can also join us live on Wednesday, May 13, at 9 am PT, to learn how to build a video analytics AI agent with VSS skills.

Build a video AI agent with VSS skills and coding agents
In the past, developers had to manually configure, deploy and integrate the rich set of microservices VSS provides for video management, search, summarization and more to build video analytic applications. Today, it’s possible to use coding agents augmented with VSS skills to automate the deployment, usage and integration of VSS all through a simple agentic chat interface.
VSS skills are hosted on the VSS GitHub Repository and follow the agent skills specification, allowing them to be used with a wide variety of agents. A prerequisite to utilizing these skills is to have a system that is set up to run VSS and an agent compatible with skills such as Codex, Claude Code, OpenClaw, or NemoClaw.
First we will show an example of how to add VSS skills to Codex and use it to deploy the VSS search profile. Then, we will show how to add VSS skills to OpenClaw, which will allow us to interact with our VSS deployment through nearly any chat interface to search and analyze large volumes of video.
Setting up the VSS pre-requisites
The first step is to prepare a system to run VSS. The easiest way to do this is to use the NVIDIA Brev Launchable for VSS. Go to the VSS launchable documentation page and click the “Launch Blueprint” button and then “Deploy Launchable.”
Once deployed click the Open Notebook button and navigate to the /video-search-and-summarization/scripts/deploy_vss_launchable.ipynb notebook. Paste in your NGC_CLI_API_KEY from NGC in the first cell and then execute the entire notebook including the tear-down section. This will ensure the system is fully set up for VSS and then you can make use of the deployment skill to manage our VSS deployment from our coding agent.
Once the notebook has run to completion, install the Brev CLI on your host system, launch VSCode and remotely connect to your Brev Instance following the Using Brev CLI (SSH) section from your Launchable page as shown in Figure 2, below.

Once you have a remote access configured, you can install the Codex through the VSCode extension to use as the coding agent.
Deploying VSS with Codex
In VSCode you will use the extensions tab to search for and install Codex. Once installed you need to install the VSS skills. You can do this by telling Codex to self install the VSS skills and providing it the location of our VSS Github repository as shown in the following prompt:
Read ~/video-search-and-summarization/skills/README.md and every SKILL.md file under ~/video-search-and-summarization/skills/. For each skill in the catalog, install it for this host so I can invoke it from a shell or chat session. Use the host's standard skills directory:
Claude Code: ~/.claude/skills/<name>/
Codex: ~/.codex/skills/<name>/
Hosts that follow the agentskills.io universal path: ~/.agents/skills/<name>/
Symlink each skill folder rather than copying it so a git pull here keeps every install up to date. Skip skills that are already installed and pointing at this checkout. When you're done, list the skills you registered and which directory you used.
Figure 3, below, shows how the agent will respond, verifying that it can access the VSS skills.

Once your agent is loaded with the VSS skills, you can use it to deploy the various VSS components and profiles. Then you can use Codex to deploy the new VSS Search profile, as shown in Figure 4, below.

Codex will then plan out the deployment, configure the necessary environment variables and deploy all the containers needed to enable the VSS Search capability. From here, you can continue using Codex to interact with VSS for searching videos or continue to the next section to see how to also use OpenClaw with VSS skills.
Searching videos with VSS and OpenClaw
With the search profile running you can install and configure OpenClaw to be an autonomous agent for analyzing videos using VSS.
We will show you how to set up OpenClaw on the Brev system to see what a powerful autonomous agent can do. You will follow the standard OpenClaw installation instructions from the VSCode terminal connected to the Brev instance and use the recommended installer script.
After running through the initial configuration, you can hatch our agent shown in Figure 5, below, and give it some context that it will be an agent for building video analytic applications using VSS.

After the initial setup, you need to provide OpenClaw with the VSS Skills. The easiest way to do this is to manually copy the skills into the OpenClaw workspace.
mkdir ~/.openclaw/workspace/skills
cp -r ~/video-search-and-summarization/skills/* ~/.openclaw/workspace/skills
Now, open up the OpenClaw UI by running the openclaw dashboard command in the terminal, which will return a clickable link to access the OpenClaw UI. Once opened, you can verify that OpenClaw has access to the VSS Skills.

Now you can tell OpenClaw to use the VSS search profile deployed in the previous section to start analyzing large volumes of video data. For this example, you will provide a path to three 10-minute videos captured in a warehouse that need to be analyzed for safe ladder usage. You want OpenClaw to use the search capability to find all instances of ladder usage in the videos and verify the worker is wearing a hardhat and safety vest. For this, you will use the following prompt:
I have a set of warehouse videos located at ~/warehouse_videos. I need to find any instances of a worker climbing a ladder and verify they are wearing a hardhat and safety vest. Can you do this with the VSS Search profile that is deployed?
Once prompted, OpenClaw will start working behind the scenes to figure out the necessary skills and associated tool calls it needs to make to complete the task.
OpenClaw makes use of the VSS skills to upload your video files to VIOS, ingest the videos through the embedding microservices to generate searchable indexes and then use the fusion search capability in VSS to find the video clips where a worker wearing a hardhat and safety vest is climbing a ladder.

Once it’s done, OpenClaw returns a concise report of all ladder usage seen across the videos as well as screenshots from the videos.
This section covered just one simple example of using Codex for deployment and OpenClaw for video analysis with VSS skills. By augmenting agents with VSS Skills, they are given endless possibilities to gain valuable insights into video data and build new applications with VSS.
Now you can dive deeper into the technology that powers the rich set of video analysis capabilities in VSS 3.
Smarter video: From alerts to search
Large-scale video search remains one of the most challenging frontiers in modern information retrieval. User queries are inherently complex and ambiguous—capturing full semantic intent within a single visual embedding is fundamentally insufficient, particularly when objects and events carry multi-layered attributes that resist simple vector representation.
At massive scale, locating a specific moment across millions of hours of footage becomes a true “needle in a haystack” problem, where nearest-neighbor search over a monolithic embedding space quickly degrades in both precision and recall.
Addressing these limitations requires a more sophisticated search architecture built on two core capabilities:
- Multi-type embedding extraction and retrieval, combined with relevance filtering and semantic deduplication.
- Search orchestration driven by agentic reasoning; decomposing complex queries into tractable sub-queries, applying reasoning-based retrieval strategies at each step, and running iterative verification and reflection loops to progressively refine results.
The search architecture first uses RTVI-CV with embedding and RTVI-embedding microservices to ingest video and extract features. The VSS agent then uses this feature data and vision-aware tools to perform a deep, iterative search on video, creating a plan and retrieving results to locate specific objects or events in the video timeline.

Modular architecture brings high flexibility and performance
VSS is designed around a docker-compose based modular developer profile system: A base agent deploys in under five minutes, and additional workflows are layered on top as needed.
| Workflow | Profile | Core Capability |
| Base / Q&A | base | VLM-based Q&A and report generation on short clips |
| Alert Verification | alerts (verification) | CV pipeline + Behavior Analytics + VLM verification |
| Real-Time VLM Alerts | alerts (VLM) | Continuous VLM anomaly detection on live streams |
| Search | search | Agentic multi-embedding search across video archives |
| Video Summarization | lvs | Chunked summarization of extended recordings |
Each workflow is supported on several types of GPUs in various configurations to meet your hardware and performance needs.
Let’s look at some benchmarks for the various workflows and configurations.
The agentic search workflow can be characterized by its maximum concurrent input streams, the time it takes to ingest the incoming streams and the retrieval latency to receive a search result. Table 2, below, shows these metrics on single GPU configurations for H100 and NVIDIA RTX PRO 6000.
| GPU | Max Concurrent Streams | Max Ingestion Latency (s) | Retrieval Latency (s) |
| 1x H100 | 33 | 0.079 | 2.24 |
| 1x RTX PRO 6000 | 51 | 0.101 | 1.87 |
For the alert verification workflow, the maximum number of concurrent streams is measured along with the latency for the verification to take place. Table 3, below, shows these metrics measured using RT-DETR as the detector, Cosmos Reason 2 as the VLM verifier operating on streams with an average of 1 alert event per minute.
| GPU | Max Concurrent Streams | Verification Latency (s) |
| 1x DGX Spark 1x AGX Thor | 14 | 0.89 |
| 1x H100 | 147 | 1.01 |
| 1x RTX PRO 6000 | 87 | 0.82 |
The long video summarization microservice rapidly produces summaries on hours of video footage. Figure below, shows the time it takes for a given GPU configuration to summarize an hour long video. Scaling the LVS microservice to multiple GPUs can greatly decrease the summarization time.

Get started with VSS skills
VSS skills enable developers to transform video into searchable, meaningful data using natural language—making it easier to uncover insights, generate summaries, and build smarter applications.
To dive deeper into VSS, see the documentation. Explore all VSS skills in Github.
For technical questions, visit our forum.