Computer Vision / Video Analytics

Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization

Vision language models (VLMs) have transformed video analytics by enabling broader perception and richer contextual understanding compared to traditional computer vision (CV) models. However, challenges like limited context length and lack of audio transcription still exist, restricting how much video a VLM can process at a time. 

To overcome this, the NVIDIA AI Blueprint for video search and summarization (VSS) integrates VLMs, LLM, and retrieval-augmented generation (RAG) with efficient ingestion, retrieval and storage mechanisms that combine to enable stored and real-time video analysis. Visual AI agents can be applied to a multitude of use cases such as monitoring smart spaces, warehouse automation, and SOP validation.

NVIDIA announces a new release and general availability (GA) of NVIDIA AI Blueprint for video search and summarization. This release includes several new features including multi-live stream, burst mode ingestion, customizable CV pipeline, and audio transcription. These updates further streamline the development of video analytics AI agents, to provide a comprehensive solution for long-form video understanding.

This post follows a previous one, Build a Video Search and Summarization Agent with NVIDIA AI Blueprint, which provides an overview of this blueprint’s foundational capabilities.

Video 1. Learn how to advance video analytics with AI agents accelerated by NVIDIA NIM and NVIDIA Metropolis

AI agents for advanced video analytics

VSS accelerates the development of video analytics AI agents by providing a recipe for long-form video understanding using VLMs, large language models (LLMs), and the latest RAG techniques and video ingestion pipeline. The early access release (v2.0.0) allowed for the ingestion of streamed and recorded videos by a visual agent that provides summaries, performs Q&A, and sends alerts.

This general availability release (v2.3.0) includes the following key features. Figure 1 illustrates the updated architecture diagram reflecting these enhancements.

  • Single-GPU deployment and hardware support expansion: Depending on your performance requirements, VSS is now available for deployment across a variety of different hardware configurations. For smaller workloads, we now also support single GPU deployment on NVIDIA A100, H100, and H200 GPUs.
  • Multi-live stream and burst clip modes: Concurrently process hundreds of live streams or prerecorded video files.
  • Audio transcription: Convert speech to text for a multimodal understanding of a scene. This is useful for use cases where audio is a key component, such as instructional videos, keynotes, team meetings, or company training content.
  • Computer vision pipeline: Enhance accuracy by tracking objects in a scene with zero-shot object detection and using bounding boxes and segmentation masks with Set-of-Mark (SoM), which guides vision-language models using a predefined set of reference points or labels to improve detection.
  • Contextually aware RAG (CA-RAG) and GraphRAG accuracy and performance improvements: Improve performance with batched summarization and entity extraction, dynamic graph creation during chunk ingestion, and running CA-RAG in a dedicated process with an independent event loop, significantly reducing latency and improving scalability.
An architecture diagram shows two new blocks for audio processing and CV pipeline, compared to the previous architecture. These new blocks are optional and must be enabled during deployment.
Figure 1. High-level architecture of the VSS GA release

Single-GPU deployment

A single-GPU deployment recipe using low memory modes and smaller LLMs has been introduced. It is available on NVIDIA H100, H200, and A100 (80 GB+, HBM) machines, with support for additional GPUs coming soon. This setup is ideal for smaller workloads that do not necessitate multi-GPU environments, offering significant cost savings and simplicity in deployment.

This deployment runs the VLM, LLM, embedding, and reranker models locally on one single GPU. The configuration details are as follows:

  • Model allocation: All models (VSS, LLM, embedding, reranking) are configured to share a single GPU.
  • Memory optimization: Low memory mode and relaxed memory constraints are enabled for the LLM to ensure efficient usage of GPU resources.
  • Model selection: Uses a smaller LLM model (Llama 3.1 8B Instruct) specifically chosen for optimal performance on a single GPU deployment. The VSS engine is set to use the NVILA model for vision tasks.
  • Service initialization: Appropriate init containers are configured to ensure that services start in the correct order.

Multi-live stream and burst clip modes

With the growing demand for real-time video analysis and the necessity to process large volumes of video clips simultaneously, the latest features ensure that deployed AI agents can manage multiple live streams and burst clips to scale video analytics solutions.

With this update, the VSS backend takes care of queuing and scheduling the requests for multiple streams in parallel. With the help of CA-RAG, it also maintains contexts for each of the sources separately. Call any of the APIs, including Summarization (POST/summarize) and Q&A (POST/chat/completions), in parallel across different threads or processes for various video files or live streams.

To facilitate multi-stream processing, each chunk of data—whether it is a caption generated by the VLM or an extracted entity—is tagged with a unique stream ID. This stream ID serves as a key identifier, ensuring that all related captions, entities, and relationships remain associated with their respective streams. 

Users have the flexibility to query across all streams by setting multi_channel: true or restrict their queries to a specific stream by setting multi_channel: false, enabling both broad and targeted analysis.

Audio transcription

NVIDIA empowered blueprint-generated visual agents with the ability to hear, leading to improved contextual understanding and unlocking information not captured by video. This feature greatly improves the accuracy of media such as keynotes, lectures, video meetings, and point of view footage.

To achieve audio integration into VSS, we applied techniques similar to our video processing methods to handle the audio of a given video. After chunking the video to parallelize ingestion across GPUs, the audio is processed through the following: 

  • Split the audio from the video clip: Create a separate audio file from the video. 
  • Decode the audio: Each audio chunk is then converted to 16 kHz mono audio. 
  • Process with automatic speech recognition (ASR): The converted audio is then passed to the NVIDIA Riva ASR NIM microservice, which generates the audio transcript for the chunk.
  • Combine audio and visual information: For each chunk, the video description from the VLM and the audio transcript from the ASR service, along with additional metadata, like timestamp information, are sent to the retrieval pipeline for further processing and indexing.

The audio processing feature in VSS can be enabled or disabled during initialization. Each summarization request can also be configured to enable or disable audio transcription. This flexibility enables audio transcription in batch processing of video files, as well as online processing of live streams. 

By using the RIVA ASR NIM microservice, we can provide state-of-the-art audio features as they are introduced into the NIM microservice. These customizations ensure that you can tailor the audio processing capabilities to your specific needs, enhancing the overall functionality and adaptability of the VSS.

This capability has been effectively used to enable chat on NVIDIA GTC keynote, enabling users to interact and discuss the content in real-time through audio transcription.

Computer vision pipeline 

Integrating specific CV models with VLMs enhances video analysis by providing detailed metadata on objects, including their positions, masks, and tracking IDs. SoM prompting enables effective visual grounding, allowing VLMs to generate responses based on individual objects rather than overall scenes, which is particularly useful for complex queries involving multiple objects and for understanding the temporal behavior of objects over longer periods using the object IDs.

Video 2. Watch a comparison of prompting with and without CV metadata

The CV and tracking pipeline in VSS is designed to generate comprehensive CV metadata for both videos and live streams. This metadata encompasses detailed information about the objects within the video, such as their positions, masks, and tracking IDs. The pipeline achieves this through the following:

  • Object detection: Each chunk undergoes object detection using a zero-shot object detector (Grounding DINO). This identifies objects based on text prompts, enabling the specification of multiple object classes and detection confidence thresholds.
  • Mask generation and tracking: After identifying the objects, a GPU-accelerated multi-object tracker, using the NvDCF tracker from NVIDIA DeepStream, is employed to track all objects. This multi-object tracker integrates the SAM2 model from Meta for instance segmentation mask generation and for enhanced precision. 
  • Metadata fusion: A major challenge in CV processing is that the same object may appear in different chunks and get assigned different IDs. To address this, VSS includes a CV Metadata Fusion module that merges and fuses CV metadata from each chunk into a single, comprehensive metadata set, as if generated from a continuous video file.
  • Data processing pipeline: The fused CV metadata is then passed to the data processing pipeline, which generates CV metadata overlaid input frames for the VLM to perform SoM prompting.
  • Dense caption generation: The fused CV metadata and VLM-generated dense captions are generated.

Here’s an example. For traffic monitoring, enabling the CV pipeline with user-specified object classes such as “vehicle, truck” enables the detection and tracking of these objects in the video. Each video chunk is processed by the VLM model, overlaying sampled frames with object IDs and segmentation masks. The VLM model uses these IDs to generate dense captions and facilitate question-and-answer interactions. For instance, if multiple red cars appear in a long traffic intersection video, specifying the exact object ID ensures clarity in identifying which car is being referred to (Figure 2).

A sample frame of traffic intersection video shows a CV overlay that includes object IDs and segmentation masks. An example Q&A text box shows how labeled IDs provide more contextual answers. The query says, “Do you see any abnormal events in the video clip? If so, which cars are involved?” The response says, “Yes, I see an abnormal event in the video clip, which is a collision between two cars. The cars involved are a red car (labeled 20) and a yellow car (labeled 21). The collision occurs at the intersection and is described in Event 1: Collision.”
Figure 2. Sample frame with object IDs and segmentation masks along with Q&A

Following the VLM, audio, and CV pipelines, the video captions from VLM, audio transcripts, and bounding boxes and segmentation masks, along with additional metadata, such as timestamp information, are then sent to the retrieval pipeline for further processing and indexing, as shown in Figure 3.

The diagram shows three different modality outputs from a basketball video: dense captions, CV and tracking metadata, and the audio transcript. All are sent over to the databases (vector DB and graph DB).
Figure 3. A sample output of multiple modalities from a basketball video

This fused data is embedded and stored in both the vector database and graph database to be accessed during the retrieval pipeline. This enables the agent to form temporal and spatial relationships between entities within the scene while reinforcing its visual understanding based on the audio transcript.

Optimizing agentic retrieval through CA-RAG

CA-RAG is a specialized module within the Video Search and Summarization Agent that enhances the retrieval and generation of contextually accurate information from video data. 

CA-RAG extracts useful information from the per-chunk VLM response and aggregates the information to perform useful tasks such as summarization, Q&A and alerts. For more information about each task, see Build a Video Search and Summarization Agent with NVIDIA AI Blueprint

The capabilities this enables include:

  • Temporal reasoning: Understands sequences of events across time.
  • Multi-hop reasoning: Connects multiple pieces of information to answer complex queries.
  • Anomaly detection: Identifies unusual patterns or behaviors in video content.
  • Scalability: Handles extensive video datasets efficiently.

To enhance performance and efficiency, several key improvements have been made to CA-RAG:

  • Batched summarization and entity extraction
  • GraphRAG optimization
  • Separate process

Batched summarization and entity extraction

CA-RAG now features a built-in Batcher implementation to optimize performance. This method handles out-of-order video chunk captions by organizing documents into batches for asynchronous processing. 

When all batches are complete, final tasks, such as summarization aggregation can proceed, enhancing efficiency and reducing latency.

GraphRAG optimization

Previously, CA-RAG waited for all VLM captions before constructing the graph, introducing latency. 

Now, CA-RAG dynamically creates the graph while ingesting chunks, enabling parallel processing of graph creation and summarization. This reduces overall processing time and improves scalability.

Separate process

CA-RAG now runs in its own dedicated process with an independent event loop for handling asynchronous requests. This change eliminates bottlenecks from shared execution contexts, enabling true parallelism between the data processing pipeline and CA-RAG. 

The result is improved system responsiveness, reduced latency, and maximized resource utilization for large-scale workloads.

VSS blueprint performance

VSS blueprint is highly optimized for NVIDIA GPUs, achieving up to 100x speedup on video summarization tasks. Designed for flexibility, it can be deployed in various topologies tailored to specific use cases, ensuring optimal resource utilization.

For a single stream input, performance is measured by the latency required to complete a summarization request. In contrast, for burst video file input, performance is determined by the maximum number of video clips of specified length that can be concurrently processed for an acceptable latency. For a given deployment topology, the primary factors that affect the latency are: 

  • Video length
  • Chunk size 
  • Aggregation batch size
  • Enabling VectorRAG or GraphRAG

Video length and chunk size both affect the total number of video chunks that need to be processed, which determines the number of VLM and LLM calls required to ingest the video. Aggregation batch size determines the number of VLM outputs that will be combined in a single LLM request:

\text{Video Chunks} = \frac{\text{Video Length}}{\text{Chunk Size}}

\text{VLM Calls} = \text{Number of Video Chunks}

\text{LLM Calls} = \frac{\text{Number of Video Chunks}}{\text{Aggregation Batch Size}} + 1

\text{LLM Calls (with GraphRAG)} = \frac{\text{Number of Video Chunks}}{\text{Aggregation Batch Size}} + \text{Number of Video Chunks} + 1

The overall latency of a summarization session can be defined as the end-to-end (E2E) latency:

\text{E2E latency} = \left(\frac{\text{Upload latency}}{\text{Streaming latency}}\right) + \text{Summarization latency}

Upload or streaming latency depends on the network. Summarization latency now includes splitting the video into chunks, generating VLM captions for each chunk, and LLM calls for aggregation and final summary generation, as mentioned in above equations.

Figure 4 compares the summarization latency for a 60-minute video using 10-second chunk sizes across various topologies and models. Figure 5 illustrates how many clips can be processed in 1 minute given an input video length, showcasing the system’s throughput for burst file input.

Bar graph showing time to summarize a 60-minute video using 10-second chunk size, across different GPU deployments: 8xH100, 8xH200, 4xH100, 8xA100 (80GB), 8xL40S, 1xH100.
Figure 4. Time to summarize a 60-minute video (with 10-second chunk size), across different GPU deployments
Bar graphs showing burst file throughput, with number of videos processed in 1 minute on the y-axis, and varying lengths of input videos on the x-axis.
Figure 5. Burst file throughput, showing number of videos processed in 1 minute for varying lengths of input videos

Optimal chunk size depends on the dynamics of the video and the level of detail required in the summary or Q&A output. A small chunk size increases the temporal granularity, allowing fast-moving objects, events, or actions to be captured, such as a car speeding down a highway. However, if the events of interest are slow moving and spread out over time, such as detecting wildfire spread, a higher chunk size could be used to reduce redundant processing.

Development and deployment options

NVIDIA offers a variety of deployment options to suit different needs, thanks to the modular blueprint. This flexibility enables easy configuration and customization, ensuring that these solutions can be tailored to your specific requirements.

  • NVIDIA API Catalog
  • NVIDIA Launchables
  • Docker or Helm chart deployment
  • Cloud deployment

NVIDIA API Catalog

For more information about the blueprint and try out some examples, see the VSS blueprint demo on build.nvidia.com.

NVIDIA Launchables

NVIDIA Launchables deliver preconfigured, fully optimized compute and software environments in the cloud. 

This deployment uses the docker compose method to set up the VSS blueprint, providing a streamlined and efficient deployment process. Deploy the VSS blueprint to try it on your own videos.

Docker or Helm chart deployment

NVIDIA provides deployment options using both docker compose and one-click Helm charts. These methods can be individually configured for more fine-grained deployments, such as swapping models. For more information, see the VSS Deployment Guide.

Cloud deployment

VSS includes a collection of deployment scripts that offer robust, flexible, and secure ways to deploy applications across multiple cloud platforms. Currently, AWS is supported, with Azure and GCP support coming soon. 

This comprehensive toolkit enables consistent deployment across different cloud environments. For more information about AWS deployment, see the VSS Cloud Deployment Guide

The modular architecture consists of the following layers:

  • Infrastructure: Handles cloud provider-specific setup.
  • Platform: Manages Kubernetes and related platform components.
  • Application: Deploys the actual application workloads.

Summary

Download the blueprint and start developing with your NVIDIA developer account. For more information, see the following resources:

To learn more, join NVIDIA founder and CEO Jensen Huang for the COMPUTEX 2025 keynote and attend GTC Taipei sessions at COMPUTEX 2025 through May 23.

Stay up to date by subscribing to our newsletter and following NVIDIA AI on LinkedIn, Instagram, X and Facebook. Explore the NVIDIA documentation and YouTube channels, and join the NVIDIA Developer Vision AI Forum

Discuss (0)

Tags