Computer Vision / Video Analytics

Build a Video Search and Summarization Agent with NVIDIA AI Blueprint

Decorative image of icons and a molecular structure in green.

Jan 06, 2025

By Samuel Ochoa, Shaunak Gupte, Shivam Lakhotia, Unnikrishnan Sreekumar, Tushar Khinvasara, Ashwani Agarwal, Prashant Gaikwad, Bhushan Rupde and Kaustubh Purandare

Discuss (2)

AI-Generated Summary

Dislike

NVIDIA AI Blueprint for Video Search and Summarization enables the development of video analytics AI agents that can understand natural language prompts and perform visual question answering by combining vision language models (VLM), large language models (LLM), and Graph-RAG techniques.
The blueprint provides a recipe for long-form video understanding, allowing AI agents to process live or archived images or videos to extract actionable insights using natural language, and is capable of tasks such as summarization, Q&A, and detecting events on live streaming video.
The architecture of the blueprint includes components like stream handler, NeMo Guardrails, VLM pipeline, VectorDB, CA-RAG module, and Graph-RAG module, which work together to enable scalable and GPU-accelerated video understanding agents.

AI-generated content may summarize information incompletely. Verify important information. Learn more

This post was originally published July 29, 2024 but has been extensively revised with NVIDIA AI Blueprint information.

Traditional video analytics applications and their development workflow are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects.

With generative AI, NVIDIA NIM microservices, and foundation models, you can now build applications with fewer models that have broad perception and rich contextual understanding.

The new class of generative AI models, vision language models (VLM), powers video analytics AI agents that can understand natural language prompts and perform visual question answering. By combining VLMs, LLMs, and the latest Graph-RAG techniques, you can build a powerful video analytics AI agent that is capable of long-form video understanding.

These video analytics AI agents will be deployed throughout factories, warehouses, retail stores, airports, traffic intersections, and more. They’ll help operations teams make better decisions using richer insights generated from natural interactions.

In this post, we show you how to seamlessly build an AI agent for long-form video understanding using NVIDIA AI Blueprint for Video Search and Summarization. You can apply for early access to this new AI Blueprint.

Releasing NVIDIA AI Blueprint for Video Search and Summarization

Video 1. Build Visual AI Agents with Vision Language Models

NVIDIA AI Blueprints, powered by NVIDIA NIM, are reference workflows for canonical generative AI use cases. NVIDIA NIM is a set of microservices that includes industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime. It delivers multiple VLMs for building a video analytics AI agent that can process live or archived images or videos to extract actionable insight using natural language.

The new AI Blueprint for Video Search and Summarization accelerates the development of video analytics AI agents by providing a recipe for long-form video understanding using VLMs, LLMs, and the latest RAG techniques.

To interact with the agent, a set of easy-to-use REST APIs are available to enable video summarization, interactive Q&A over videos, and custom alerts on live streams to find specific events. The REST APIs can be used to integrate the agent into your own application and are used by the reference UI for quick testing.

The models used in the blueprint can come from the NVIDIA API Catalog of model preview APIs and downloadable NIM microservices. For example, the AI Blueprint uses the NVIDIA-hosted llama-3_1-70b-instruct NIM microservice as the LLM for NVIDIA NeMo Guardrails, Context-Aware RAG (CA-RAG), and Graph-RAG modules. You can choose from a wide range of different LLMs and VLMs from the API Catalog, either NVIDIA-hosted or locally deployed.

Video analytics AI agent for video search and summarization

Building a video analytics AI agent capable of understanding long-form videos requires a combination of VLMs and LLMs ensembled together with datastores. The blueprint provides a recipe for combining all of these components to enable scalable and GPU-accelerated video understanding agents that can perform several tasks such as summarization, Q&A, and detecting events on live streaming video.

The blueprint consists of the following components:

Stream handler: Manages the interaction and synchronization with the other components such as NeMo Guardrails, CA-RAG, the VLM pipeline, chunking, and the Milvus Vector DB.
NeMo Guardrails: Filters out invalid user prompts. It makes use of the REST API of an LLM NIM microservice.
VLM pipeline – Decodes video chunks generated by the stream handler, generates the embeddings for the video chunks using an NVIDIA Tensor RT-based visual encoder model, and then makes use of a VLM to generate per-chunk response for the user query. It is based on the NVIDIA DeepStream SDK.
VectorDB: Stores the intermediate per-chunk VLM response.
CA-RAG module: Extracts useful information from the per-chunk VLM response and aggregates it to generate a single unified summary. CA-RAG (Context Aware-Retrieval-Augmented Generation) uses the REST API of an LLM NIM microservice.
Graph-RAG module: Captures the complex relationships present in the video and stores important information in a graph database as sets of nodes and edges. This is then queried by an LLM for interactive Q&A.

A diagram shows the architecture of the visual search and summarization agent. It includes the data flow of how videos are processed and used to generate summaries, alerts and Q&A. — *Figure 1. High-level architecture of the summarization vision AI agent*

Here’s more information about the video ingestion and retrieval pipeline and how the blueprint is capable of summarization, Q&A, and alerts over live streams and long videos.

Video ingestion

To summarize a video or perform Q&A, a comprehensive index of the video must be built that captures all the important information. This is done by combining VLMs and LLMs to produce dense captions and metadata to build a knowledge graph of the video. This video ingestion pipeline is GPU-accelerated and scales with more GPUs to lower processing time.

VLM pipeline and CA-RAG

Most VLMs today accept only a limited number of frames, for example, 8/10/100. They also can’t accurately generate captions for longer videos. For longer videos such as hour-long videos, sampled frames could be 10s of seconds apart or even longer. This can result in some details getting missed or actions not getting recognized.

A solution to this problem is to create smaller chunks from long videos, analyze the chunks individually using VLMs to produce dense captions, and then summarize and aggregate results to generate a single summary for the entire file. This part of the ingestion process is the VLM pipeline and CA-RAG module.

This strategy of chunking and captioning can also be applied to live streams. The blueprint includes a streaming pipeline that receives streaming data from an RTSP server. The NVIDIA AI Blueprint continuously generates video-chunk segments based on the user-configured chunk duration. The VLM pipeline then generates the captions for these chunks.

The NVIDIA AI Blueprint keeps on gathering the captions from the VLM pipeline. When enough chunks are processed based on the user-configured summary duration, the chunks gathered are sent to CA-RAG for summarization and aggregation. The blueprint continues processing the next chunks. The summaries are streamed to the client using HTTP server-sent events.

Knowledge graph and Graph-RAG module

To capture the complex information produced by the VLM, a knowledge graph is built and stored during video ingestion. Use an LLM to convert the dense captions in a set of nodes, edges, and associated properties. This knowledge graph is stored in a graph database. By using Graph-RAG techniques, an LLMcan access this information to extract key insights for summarization, Q&A, and alerts and go beyond what VLMs are capable of on their own.

Video retrieval

When the video has been ingested, the databases behind the CA-RAG and Graph-RAG modules contain an immense amount of information about the objects, events, and descriptions of what occurred in the video. This information can be queried and consumed by an LLM for several tasks, including summarization, Q&A, and alerts.

For each of these tasks, the blueprint exposes simple REST APIs that can be called to integrate with your application. A reference UI is also provided to enable you to quickly experiment with the features of the blueprint and tune the agent with several configuration options.

Summarization

When a video file has been uploaded to the agent through the APIs, call the summarize endpoint to get a summary of the video. The blueprint takes care of all the heavy lifting while providing a lot of configurable parameters.

When submitting the summarize request, there are prompts used to tune the outputs. This controls the VLM dense captioning and the LLM-based caption aggregation to produce the final summary.

Prompt (VLM): Prompt given to the VLM to produce dense captions. This prompt can be tuned to tell the VLM exactly what type of objects, events, and actions it should pay attention to.
Caption summarization (LLM): An LLM prompt used to combine the VLM captions. This can be used to control how fine-grained the captions should be and the level of detail to include.
Summary aggregation (LLM): Produces the final summary output based on the aggregated captions. This prompt should be tuned to specify an output format, length of the summary, and a list of any key pieces of information that should be included in the output.

In addition to the prompt configuration, the strategy to chunk the video is also important to tune based on your use case. There are a few different options depending on whether the summarization is over a video file or a live stream.

Video files

chunk_duration: The entire video is divided into chunk_duration length segments, N (VLM-dependent) frames are sampled from this chunk and sent to VLM for inference. The chunk duration should be small enough that the N frames can capture the event.
chunk_overlap: If an event occurs at the chunk intersection, then the sampled frames might not capture the complete event and the model can’t detect it. The NVIDIA AI Blueprint alleviates this problem by using a sliding window approach where chunk_overlap is the overlap duration between the chunks. (Default: 0).

Streams

chunk_duration: Similar to video files, the live stream is divided into segments of chunk_duration and sent to VLM for inference. The chunk duration should be small enough that the N frames can capture the event.
summary_duration: The duration for which the user wants a summary. This enables the user to control the duration of the stream for which the summary should be produced. For instance, if chunk_duration is 1 min and the summary duration is 30 min., then the stream is divided into 1-min. chunks for VLM inference. The VLM output of 30 chunks is aggregated to provide the user with a 30-min. concise summary.

These are just guidelines and the actual parameter must be tuned for specific use cases. It’s a tradeoff between accuracy and performance. Smaller chunk sizes result in better descriptions but take longer to process.

Q&A

The knowledge graph built during video ingestion can be queried by an LLM to provide a natural language interface into the video. This enables users to ask open-ended questions over the input video and have a chatbot experience. In the reference UI, this feature is available after the video has been ingested.

The LLM used to power Q&A is configurable and can be adjusted through the blueprint configuration after deployment. It gives you the control to choose a model that best suits your local deployment or point it to an LLM deployed in the cloud.

The prompts given to the LLM to retrieve the information needed from the knowledge are adjustable and can be tuned to improve the accuracy of the responses.

Alerts

In addition to video files, the blueprint can also accept a video live stream as input. For live streaming use cases, it is often critical to know when certain events take place in near real-time. To accomplish this, the blueprint enables live streams to be registered and alert rules can be set to monitor the stream. These alert rules are in natural language and are used to trigger notifications when user-defined events occur.

For example, a camera set in a forest could be set with alert rules to detect when animals come into view or if a fire breaks out. When the stream is registered and the alert rules are set, the agent monitors the stream. If it detects that any of the alert rules are true, then it triggers a notification that can be received through the APIs.

Video 2. Build Visual Agents for Video Search and Summarization

Getting started

Build powerful VLM-based AI agents using NVIDIA AI Blueprint for Video Search and Summarization. REST APIs provide ease of integration of this workflow and VLMs in existing customer applications. Apply for early access to this AI Blueprint now and see the Video analytics AI Agents forum for technical questions.

Discuss (2)

About the Authors

About Samuel Ochoa
Samuel Ochoa is a technical marketing engineer with the Metropolis team at NVIDIA, focusing on bringing AI to industrial applications. Prior to NVIDIA, he graduated from the University of Texas at Austin with a BS and MS in Computer Engineering with a specialty in machine learning for edge devices.

View all posts by Samuel Ochoa

About Shaunak Gupte
Shaunak Gupte is a senior software engineer in the Intelligent Video Analytics group, currently focusing on generative AI applications in video analytics. Prior to NVIDIA, he obtained his M.Tech. in electrical engineering from the Indian Institute of Technology, Bombay.

View all posts by Shaunak Gupte

About Shivam Lakhotia
Shivam Lakhotia is a senior systems software engineer in the Intelligent Video Analytics group, focusing on generative AI applications in video analytics and context-aware RAG systems for long-video summarization and question answering. His expertise includes multimodal graph RAG and developing scalable AI agents. He graduated with a bachelor's in computer science from the Indian Institute of Technology in Guwahati, and a master's in computer science with a focus on AI from the University of California, San Diego.

View all posts by Shivam Lakhotia

About Unnikrishnan Sreekumar
Unnikrishnan is a senior software engineer on the NVIDIA DeepStream SDK team and he enjoys creating products that deliver next-generation solutions for media analytics. His expertise includes embedded systems, firmware, multimedia streaming, multimedia analytics, and AI. He is currently working on frameworks that aid in running AI inference pipelines for vision analytics, audio processing, sensor fusion, and more. He also worked on implementing protocol specifications like WiFi Alliance WiFi Display, WiFi P2P, MPEG2TS, RTP, RTSP, and so on.

View all posts by Unnikrishnan Sreekumar

About Tushar Khinvasara
Tushar Khinvasara is a principal software engineer in the Intelligent Video Analytics group at NVIDIA. Tushar mainly focuses on architecting and designing software solutions that harness the power of AI and generative technologies to solve complex problems in the video analytics domain. He obtained his B.Sc. in computer science from the University of Pune, India.

View all posts by Tushar Khinvasara

About Ashwani Agarwal
Ashwani Agarwal is a senior software engineer in the Intelligent Video Analytics group currently focusing on generative AI applications in video analytics. Ashwani has a work experience of over 5 years in the field of video analytics using AI. He completed his M.Sc. in electrical and computer engineering from Northeastern University, Boston.

View all posts by Ashwani Agarwal

About Prashant Gaikwad
Prashant Gaikwad is a senior system software manager at NVIDIA and manages GXF and Cloud Tools.
Follow @prashan_gaikwad on Twitter

View all posts by Prashant Gaikwad

About Bhushan Rupde
Bhushan Rupde manages the Intelligent Video Analytics group at NVIDIA, Pune and is currently responsible for Vision AI solutions. Bhushan holds a M.Tech. in electrical engineering from IIT Bombay, India.

View all posts by Bhushan Rupde

About Kaustubh Purandare
Kaustubh Purandare is a senior director of software engineering at NVIDIA and is responsible for VIA, DeepStream, GXF, generative AI for Metropolis, and VLM and CV NIM microservices.

View all posts by Kaustubh Purandare