Vision language models (VLMs) have transformed video analytics by enabling broader perception and richer contextual understanding compared to traditional computer vision (CV) models. However, challenges like limited context length and lack of audio transcription still exist, restricting how much video a VLM can process at a time.
To overcome this, the NVIDIA AI Blueprint for video search and summarization (VSS) integrates VLMs, LLM, and retrieval-augmented generation (RAG) with efficient ingestion, retrieval and storage mechanisms that combine to enable stored and real-time video analysis. Visual AI agents can be applied to a multitude of use cases such as monitoring smart spaces, warehouse automation, and SOP validation.
NVIDIA announces a new release and general availability (GA) of NVIDIA AI Blueprint for video search and summarization. This release includes several new features including multi-live stream, burst mode ingestion, customizable CV pipeline, and audio transcription. These updates further streamline the development of video analytics AI agents, to provide a comprehensive solution for long-form video understanding.
This post follows a previous one, Build a Video Search and Summarization Agent with NVIDIA AI Blueprint, which provides an overview of this blueprint’s foundational capabilities.
AI agents for advanced video analytics
VSS accelerates the development of video analytics AI agents by providing a recipe for long-form video understanding using VLMs, large language models (LLMs), and the latest RAG techniques and video ingestion pipeline. The early access release (v2.0.0) allowed for the ingestion of streamed and recorded videos by a visual agent that provides summaries, performs Q&A, and sends alerts.
This general availability release (v2.3.0) includes the following key features. Figure 1 illustrates the updated architecture diagram reflecting these enhancements.
- Single-GPU deployment and hardware support expansion: Depending on your performance requirements, VSS is now available for deployment across a variety of different hardware configurations. For smaller workloads, we now also support single GPU deployment on NVIDIA A100, H100, and H200 GPUs.
- Multi-live stream and burst clip modes: Concurrently process hundreds of live streams or prerecorded video files.
- Audio transcription: Convert speech to text for a multimodal understanding of a scene. This is useful for use cases where audio is a key component, such as instructional videos, keynotes, team meetings, or company training content.
- Computer vision pipeline: Enhance accuracy by tracking objects in a scene with zero-shot object detection and using bounding boxes and segmentation masks with Set-of-Mark (SoM), which guides vision-language models using a predefined set of reference points or labels to improve detection.
- Contextually aware RAG (CA-RAG) and GraphRAG accuracy and performance improvements: Improve performance with batched summarization and entity extraction, dynamic graph creation during chunk ingestion, and running CA-RAG in a dedicated process with an independent event loop, significantly reducing latency and improving scalability.
Single-GPU deployment
A single-GPU deployment recipe using low memory modes and smaller LLMs has been introduced. It is available on NVIDIA H100, H200, and A100 (80 GB+, HBM) machines, with support for additional GPUs coming soon. This setup is ideal for smaller workloads that do not necessitate multi-GPU environments, offering significant cost savings and simplicity in deployment.
This deployment runs the VLM, LLM, embedding, and reranker models locally on one single GPU. The configuration details are as follows:
- Model allocation: All models (VSS, LLM, embedding, reranking) are configured to share a single GPU.
- Memory optimization: Low memory mode and relaxed memory constraints are enabled for the LLM to ensure efficient usage of GPU resources.
- Model selection: Uses a smaller LLM model (Llama 3.1 8B Instruct) specifically chosen for optimal performance on a single GPU deployment. The VSS engine is set to use the NVILA model for vision tasks.
- Service initialization: Appropriate
init
containers are configured to ensure that services start in the correct order.
Multi-live stream and burst clip modes
With the growing demand for real-time video analysis and the necessity to process large volumes of video clips simultaneously, the latest features ensure that deployed AI agents can manage multiple live streams and burst clips to scale video analytics solutions.
With this update, the VSS backend takes care of queuing and scheduling the requests for multiple streams in parallel. With the help of CA-RAG, it also maintains contexts for each of the sources separately. Call any of the APIs, including Summarization (POST/summarize) and Q&A (POST/chat/completions), in parallel across different threads or processes for various video files or live streams.
To facilitate multi-stream processing, each chunk of data—whether it is a caption generated by the VLM or an extracted entity—is tagged with a unique stream ID. This stream ID serves as a key identifier, ensuring that all related captions, entities, and relationships remain associated with their respective streams.
Users have the flexibility to query across all streams by setting multi_channel: true
or restrict their queries to a specific stream by setting multi_channel: false
, enabling both broad and targeted analysis.
Audio transcription
NVIDIA empowered blueprint-generated visual agents with the ability to hear, leading to improved contextual understanding and unlocking information not captured by video. This feature greatly improves the accuracy of media such as keynotes, lectures, video meetings, and point of view footage.
To achieve audio integration into VSS, we applied techniques similar to our video processing methods to handle the audio of a given video. After chunking the video to parallelize ingestion across GPUs, the audio is processed through the following:
- Split the audio from the video clip: Create a separate audio file from the video.
- Decode the audio: Each audio chunk is then converted to 16 kHz mono audio.
- Process with automatic speech recognition (ASR): The converted audio is then passed to the NVIDIA Riva ASR NIM microservice, which generates the audio transcript for the chunk.
- Combine audio and visual information: For each chunk, the video description from the VLM and the audio transcript from the ASR service, along with additional metadata, like timestamp information, are sent to the retrieval pipeline for further processing and indexing.
The audio processing feature in VSS can be enabled or disabled during initialization. Each summarization request can also be configured to enable or disable audio transcription. This flexibility enables audio transcription in batch processing of video files, as well as online processing of live streams.
By using the RIVA ASR NIM microservice, we can provide state-of-the-art audio features as they are introduced into the NIM microservice. These customizations ensure that you can tailor the audio processing capabilities to your specific needs, enhancing the overall functionality and adaptability of the VSS.
This capability has been effectively used to enable chat on NVIDIA GTC keynote, enabling users to interact and discuss the content in real-time through audio transcription.
Computer vision pipeline
Integrating specific CV models with VLMs enhances video analysis by providing detailed metadata on objects, including their positions, masks, and tracking IDs. SoM prompting enables effective visual grounding, allowing VLMs to generate responses based on individual objects rather than overall scenes, which is particularly useful for complex queries involving multiple objects and for understanding the temporal behavior of objects over longer periods using the object IDs.
The CV and tracking pipeline in VSS is designed to generate comprehensive CV metadata for both videos and live streams. This metadata encompasses detailed information about the objects within the video, such as their positions, masks, and tracking IDs. The pipeline achieves this through the following:
- Object detection: Each chunk undergoes object detection using a zero-shot object detector (Grounding DINO). This identifies objects based on text prompts, enabling the specification of multiple object classes and detection confidence thresholds.
- Mask generation and tracking: After identifying the objects, a GPU-accelerated multi-object tracker, using the NvDCF tracker from NVIDIA DeepStream, is employed to track all objects. This multi-object tracker integrates the SAM2 model from Meta for instance segmentation mask generation and for enhanced precision.
- Metadata fusion: A major challenge in CV processing is that the same object may appear in different chunks and get assigned different IDs. To address this, VSS includes a CV Metadata Fusion module that merges and fuses CV metadata from each chunk into a single, comprehensive metadata set, as if generated from a continuous video file.
- Data processing pipeline: The fused CV metadata is then passed to the data processing pipeline, which generates CV metadata overlaid input frames for the VLM to perform SoM prompting.
- Dense caption generation: The fused CV metadata and VLM-generated dense captions are generated.
Here’s an example. For traffic monitoring, enabling the CV pipeline with user-specified object classes such as “vehicle, truck” enables the detection and tracking of these objects in the video. Each video chunk is processed by the VLM model, overlaying sampled frames with object IDs and segmentation masks. The VLM model uses these IDs to generate dense captions and facilitate question-and-answer interactions. For instance, if multiple red cars appear in a long traffic intersection video, specifying the exact object ID ensures clarity in identifying which car is being referred to (Figure 2).

Following the VLM, audio, and CV pipelines, the video captions from VLM, audio transcripts, and bounding boxes and segmentation masks, along with additional metadata, such as timestamp information, are then sent to the retrieval pipeline for further processing and indexing, as shown in Figure 3.

This fused data is embedded and stored in both the vector database and graph database to be accessed during the retrieval pipeline. This enables the agent to form temporal and spatial relationships between entities within the scene while reinforcing its visual understanding based on the audio transcript.
Optimizing agentic retrieval through CA-RAG
CA-RAG is a specialized module within the Video Search and Summarization Agent that enhances the retrieval and generation of contextually accurate information from video data.
CA-RAG extracts useful information from the per-chunk VLM response and aggregates the information to perform useful tasks such as summarization, Q&A and alerts. For more information about each task, see Build a Video Search and Summarization Agent with NVIDIA AI Blueprint.
The capabilities this enables include:
- Temporal reasoning: Understands sequences of events across time.
- Multi-hop reasoning: Connects multiple pieces of information to answer complex queries.
- Anomaly detection: Identifies unusual patterns or behaviors in video content.
- Scalability: Handles extensive video datasets efficiently.
To enhance performance and efficiency, several key improvements have been made to CA-RAG:
- Batched summarization and entity extraction
- GraphRAG optimization
- Separate process
Batched summarization and entity extraction
CA-RAG now features a built-in Batcher implementation to optimize performance. This method handles out-of-order video chunk captions by organizing documents into batches for asynchronous processing.
When all batches are complete, final tasks, such as summarization aggregation can proceed, enhancing efficiency and reducing latency.
GraphRAG optimization
Previously, CA-RAG waited for all VLM captions before constructing the graph, introducing latency.
Now, CA-RAG dynamically creates the graph while ingesting chunks, enabling parallel processing of graph creation and summarization. This reduces overall processing time and improves scalability.
Separate process
CA-RAG now runs in its own dedicated process with an independent event loop for handling asynchronous requests. This change eliminates bottlenecks from shared execution contexts, enabling true parallelism between the data processing pipeline and CA-RAG.
The result is improved system responsiveness, reduced latency, and maximized resource utilization for large-scale workloads.
VSS blueprint performance
VSS blueprint is highly optimized for NVIDIA GPUs, achieving up to 100x speedup on video summarization tasks. Designed for flexibility, it can be deployed in various topologies tailored to specific use cases, ensuring optimal resource utilization.
For a single stream input, performance is measured by the latency required to complete a summarization request. In contrast, for burst video file input, performance is determined by the maximum number of video clips of specified length that can be concurrently processed for an acceptable latency. For a given deployment topology, the primary factors that affect the latency are:
- Video length
- Chunk size
- Aggregation batch size
- Enabling VectorRAG or GraphRAG
Video length and chunk size both affect the total number of video chunks that need to be processed, which determines the number of VLM and LLM calls required to ingest the video. Aggregation batch size determines the number of VLM outputs that will be combined in a single LLM request:
The overall latency of a summarization session can be defined as the end-to-end (E2E) latency:
Upload or streaming latency depends on the network. Summarization latency now includes splitting the video into chunks, generating VLM captions for each chunk, and LLM calls for aggregation and final summary generation, as mentioned in above equations.
Figure 4 compares the summarization latency for a 60-minute video using 10-second chunk sizes across various topologies and models. Figure 5 illustrates how many clips can be processed in 1 minute given an input video length, showcasing the system’s throughput for burst file input.


Optimal chunk size depends on the dynamics of the video and the level of detail required in the summary or Q&A output. A small chunk size increases the temporal granularity, allowing fast-moving objects, events, or actions to be captured, such as a car speeding down a highway. However, if the events of interest are slow moving and spread out over time, such as detecting wildfire spread, a higher chunk size could be used to reduce redundant processing.
Development and deployment options
NVIDIA offers a variety of deployment options to suit different needs, thanks to the modular blueprint. This flexibility enables easy configuration and customization, ensuring that these solutions can be tailored to your specific requirements.
- NVIDIA API Catalog
- NVIDIA Launchables
- Docker or Helm chart deployment
- Cloud deployment
NVIDIA API Catalog
For more information about the blueprint and try out some examples, see the VSS blueprint demo on build.nvidia.com.
NVIDIA Launchables
NVIDIA Launchables deliver preconfigured, fully optimized compute and software environments in the cloud.
This deployment uses the docker compose
method to set up the VSS blueprint, providing a streamlined and efficient deployment process. Deploy the VSS blueprint to try it on your own videos.
Docker or Helm chart deployment
NVIDIA provides deployment options using both docker compose
and one-click Helm charts. These methods can be individually configured for more fine-grained deployments, such as swapping models. For more information, see the VSS Deployment Guide.
Cloud deployment
VSS includes a collection of deployment scripts that offer robust, flexible, and secure ways to deploy applications across multiple cloud platforms. Currently, AWS is supported, with Azure and GCP support coming soon.
This comprehensive toolkit enables consistent deployment across different cloud environments. For more information about AWS deployment, see the VSS Cloud Deployment Guide.
The modular architecture consists of the following layers:
- Infrastructure: Handles cloud provider-specific setup.
- Platform: Manages Kubernetes and related platform components.
- Application: Deploys the actual application workloads.
Summary
Download the blueprint and start developing with your NVIDIA developer account. For more information, see the following resources:
- Preview VSS blueprint
- VSS on NVIDIA Launchable (use your own videos)
- NVIDIA-AI-Blueprints/video-search-and-summarization GitHub repo
- Visual AI Agent Forum
To learn more, join NVIDIA founder and CEO Jensen Huang for the COMPUTEX 2025 keynote and attend GTC Taipei sessions at COMPUTEX 2025 through May 23.
Stay up to date by subscribing to our newsletter and following NVIDIA AI on LinkedIn, Instagram, X and Facebook. Explore the NVIDIA documentation and YouTube channels, and join the NVIDIA Developer Vision AI Forum.