This post was originally published July 29, 2024 but has been extensively revised with NVIDIA AI Blueprint information.
Traditional video analytics applications and their development workflow are typically built on fixed-function, limited models that are designed to detect and identify only a select set of predefined objects.
With generative AI, NVIDIA NIM microservices, and foundation models, you can now build applications with fewer models that have broad perception and rich contextual understanding.
The new class of generative AI models, vision language models (VLM), powers visual AI agents that can understand natural language prompts and perform visual question answering. By combining VLMs, LLMs, and the latest Graph-RAG techniques, you can build a powerful visual AI agent that is capable of long-form video understanding.
These visual AI agents will be deployed throughout factories, warehouses, retail stores, airports, traffic intersections, and more. They’ll help operations teams make better decisions using richer insights generated from natural interactions.
In this post, we show you how to seamlessly build an AI agent for long-form video understanding using NVIDIA AI Blueprint for Video Search and Summarization. You can apply for early access to this new AI Blueprint.
Releasing NVIDIA AI Blueprint for Video Search and Summarization
NVIDIA AI Blueprints, powered by NVIDIA NIM, are reference workflows for canonical generative AI use cases. NVIDIA NIM is a set of microservices that includes industry-standard APIs, domain-specific code, optimized inference engines, and enterprise runtime. It delivers multiple VLMs for building a visual AI agent that can process live or archived images or videos to extract actionable insight using natural language.
The new AI Blueprint for Video Search and Summarization accelerates the development of visual AI agents by providing a recipe for long-form video understanding using VLMs, LLMs, and the latest RAG techniques.
To interact with the agent, a set of easy-to-use REST APIs are available to enable video summarization, interactive Q&A over videos, and custom alerts on live streams to find specific events. The REST APIs can be used to integrate the agent into your own application and are used by the reference UI for quick testing.
The models used in the blueprint can come from the NVIDIA API Catalog of model preview APIs and downloadable NIM microservices. For example, the AI Blueprint uses the NVIDIA-hosted llama-3_1-70b-instruct NIM microservice as the LLM for NVIDIA NeMo Guardrails, Context-Aware RAG (CA-RAG), and Graph-RAG modules. You can choose from a wide range of different LLMs and VLMs from the API Catalog, either NVIDIA-hosted or locally deployed.
Visual AI agent for video search and summarization
Building a visual AI agent capable of understanding long-form videos requires a combination of VLMs and LLMs ensembled together with datastores. The blueprint provides a recipe for combining all of these components to enable scalable and GPU-accelerated video understanding agents that can perform several tasks such as summarization, Q&A, and detecting events on live streaming video.
The blueprint consists of the following components:
- Stream handler: Manages the interaction and synchronization with the other components such as NeMo Guardrails, CA-RAG, the VLM pipeline, chunking, and the Milvus Vector DB.
- NeMo Guardrails: Filters out invalid user prompts. It makes use of the REST API of an LLM NIM microservice.
- VLM pipeline – Decodes video chunks generated by the stream handler, generates the embeddings for the video chunks using an NVIDIA Tensor RT-based visual encoder model, and then makes use of a VLM to generate per-chunk response for the user query. It is based on the NVIDIA DeepStream SDK.
- VectorDB: Stores the intermediate per-chunk VLM response.
- CA-RAG module: Extracts useful information from the per-chunk VLM response and aggregates it to generate a single unified summary. CA-RAG (Context Aware-Retrieval-Augmented Generation) uses the REST API of an LLM NIM microservice.
- Graph-RAG module: Captures the complex relationships present in the video and stores important information in a graph database as sets of nodes and edges. This is then queried by an LLM for interactive Q&A.
Here’s more information about the video ingestion and retrieval pipeline and how the blueprint is capable of summarization, Q&A, and alerts over live streams and long videos.
Video ingestion
To summarize a video or perform Q&A, a comprehensive index of the video must be built that captures all the important information. This is done by combining VLMs and LLMs to produce dense captions and metadata to build a knowledge graph of the video. This video ingestion pipeline is GPU-accelerated and scales with more GPUs to lower processing time.
VLM pipeline and CA-RAG
Most VLMs today accept only a limited number of frames, for example, 8/10/100. They also can’t accurately generate captions for longer videos. For longer videos such as hour-long videos, sampled frames could be 10s of seconds apart or even longer. This can result in some details getting missed or actions not getting recognized.
A solution to this problem is to create smaller chunks from long videos, analyze the chunks individually using VLMs to produce dense captions, and then summarize and aggregate results to generate a single summary for the entire file. This part of the ingestion process is the VLM pipeline and CA-RAG module.
This strategy of chunking and captioning can also be applied to live streams. The blueprint includes a streaming pipeline that receives streaming data from an RTSP server. The NVIDIA AI Blueprint continuously generates video-chunk segments based on the user-configured chunk duration. The VLM pipeline then generates the captions for these chunks.
The NVIDIA AI Blueprint keeps on gathering the captions from the VLM pipeline. When enough chunks are processed based on the user-configured summary duration, the chunks gathered are sent to CA-RAG for summarization and aggregation. The blueprint continues processing the next chunks. The summaries are streamed to the client using HTTP server-sent events.
Knowledge graph and Graph-RAG module
To capture the complex information produced by the VLM, a knowledge graph is built and stored during video ingestion. Use an LLM to convert the dense captions in a set of nodes, edges, and associated properties. This knowledge graph is stored in a graph database. By using Graph-RAG techniques, an LLMcan access this information to extract key insights for summarization, Q&A, and alerts and go beyond what VLMs are capable of on their own.
Video retrieval
When the video has been ingested, the databases behind the CA-RAG and Graph-RAG modules contain an immense amount of information about the objects, events, and descriptions of what occurred in the video. This information can be queried and consumed by an LLM for several tasks, including summarization, Q&A, and alerts.
For each of these tasks, the blueprint exposes simple REST APIs that can be called to integrate with your application. A reference UI is also provided to enable you to quickly experiment with the features of the blueprint and tune the agent with several configuration options.
Summarization
When a video file has been uploaded to the agent through the APIs, call the summarize
endpoint to get a summary of the video. The blueprint takes care of all the heavy lifting while providing a lot of configurable parameters.
When submitting the summarize
request, there are prompts used to tune the outputs. This controls the VLM dense captioning and the LLM-based caption aggregation to produce the final summary.
- Prompt (VLM): Prompt given to the VLM to produce dense captions. This prompt can be tuned to tell the VLM exactly what type of objects, events, and actions it should pay attention to.
- Caption summarization (LLM): An LLM prompt used to combine the VLM captions. This can be used to control how fine-grained the captions should be and the level of detail to include.
- Summary aggregation (LLM): Produces the final summary output based on the aggregated captions. This prompt should be tuned to specify an output format, length of the summary, and a list of any key pieces of information that should be included in the output.
In addition to the prompt configuration, the strategy to chunk the video is also important to tune based on your use case. There are a few different options depending on whether the summarization is over a video file or a live stream.
Video files
chunk_duration
: The entire video is divided intochunk_duration
length segments, N (VLM-dependent) frames are sampled from this chunk and sent to VLM for inference. The chunk duration should be small enough that the N frames can capture the event.chunk_overlap
: If an event occurs at the chunk intersection, then the sampled frames might not capture the complete event and the model can’t detect it. The NVIDIA AI Blueprint alleviates this problem by using a sliding window approach wherechunk_overlap
is the overlap duration between the chunks. (Default:0
).
Streams
chunk_duration
: Similar to video files, the live stream is divided into segments ofchunk_duration
and sent to VLM for inference. The chunk duration should be small enough that the N frames can capture the event.summary_duration
: The duration for which the user wants a summary. This enables the user to control the duration of the stream for which the summary should be produced. For instance, ifchunk_duration
is 1 min and the summary duration is 30 min., then the stream is divided into 1-min. chunks for VLM inference. The VLM output of 30 chunks is aggregated to provide the user with a 30-min. concise summary.
These are just guidelines and the actual parameter must be tuned for specific use cases. It’s a tradeoff between accuracy and performance. Smaller chunk sizes result in better descriptions but take longer to process.
Q&A
The knowledge graph built during video ingestion can be queried by an LLM to provide a natural language interface into the video. This enables users to ask open-ended questions over the input video and have a chatbot experience. In the reference UI, this feature is available after the video has been ingested.
The LLM used to power Q&A is configurable and can be adjusted through the blueprint configuration after deployment. It gives you the control to choose a model that best suits your local deployment or point it to an LLM deployed in the cloud.
The prompts given to the LLM to retrieve the information needed from the knowledge are adjustable and can be tuned to improve the accuracy of the responses.
Alerts
In addition to video files, the blueprint can also accept a video live stream as input. For live streaming use cases, it is often critical to know when certain events take place in near real-time. To accomplish this, the blueprint enables live streams to be registered and alert rules can be set to monitor the stream. These alert rules are in natural language and are used to trigger notifications when user-defined events occur.
For example, a camera set in a forest could be set with alert rules to detect when animals come into view or if a fire breaks out. When the stream is registered and the alert rules are set, the agent monitors the stream. If it detects that any of the alert rules are true, then it triggers a notification that can be received through the APIs.
Getting started
Build powerful VLM-based AI agents using NVIDIA AI Blueprint for Video Search and Summarization. REST APIs provide ease of integration of this workflow and VLMs in existing customer applications. Apply for early access to this AI Blueprint now and see the Visual AI Agents forum for technical questions.