Make Sense of Video Analytics by Integrating NVIDIA AI Blueprints

Organizations are increasingly seeking ways to extract insights from video, audio, and other complex data sources. Retrieval-augmented generation (RAG) enables generative AI systems to use proprietary enterprise data. However, incorporating video content into these workflows introduces new technical hurdles, such as efficient ingestion, indexing, and maintaining compliance across diverse sources.

This blog post introduces an integrated approach for enriching video analysis and summarization using the NVIDIA AI Blueprint for video search and summarization (VSS) and the NVIDIA AI Blueprint for retrieval-augmented generation (RAG). By composing these workflows, developers can supplement video understanding with trusted, context-rich enterprise data, unlocking deeper insights for business-critical applications.

In this post, you’ll learn how to:

Integrate VSS and RAG Blueprints for multimodal search and summarization.
Enrich video analytics with contextual enterprise knowledge.
Architect scalable, modular workflows for real-time video Q&A and summarization.
Apply these solutions to real-world use cases across industries.

Following up on our earlier post about the VSS Blueprint, we’ll now explain how merging VSS with RAG improves video analysis. This combination provides more accurate, context-aware insights for enterprise AI applications.

What are NVIDIA AI Blueprints?

NVIDIA AI Blueprints are customizable reference workflows for building generative AI pipelines. Developers can use NVIDIA AI Blueprints to build multimodal RAG pipelines. The RAG Blueprint is built on NVIDIA NeMo Retriever models for continuously indexing multimodal documents for fast and accurate semantic search at enterprise scale. The VSS Blueprint ingests massive volumes of streaming or archival video for search, summarization, interactive Q&A, and event-trigger actions such as alerting.

A real-world application: Building AI-powered health insights with RAG and VSS Blueprints

The following is an example comparing raw VSS Blueprint output to context-enriched insights with the RAG Blueprint. The input video shows someone making breakfast. This use case illustrates how AI can analyze what a person is eating for breakfast and comment on the health of their eating habits. In the first example, the AI generates a video summary without any additional RAG information, and in the second example, the AI uses data from RAG, resulting in a more detailed and informative summary. The first screen capture shows the VSS Blueprint’s default video event summarization of a breakfast preparation routine. The output clusters key actions under categories like ingredient selection, cooking techniques, nutritional insights, hygiene practices, and presentation tips. The default VSS output is factual and descriptive, but it doesn’t connect ‌observed activities to nutritional value or healthy habits.

Figure 1 shows a bullet-point summary of a breakfast video, with categories for ingredient selection, cooking techniques, nutritional insights, hygiene practices, and presentation tips. The entries are factual descriptions of observed actions, such as pouring milk and making oatmeal — *Figure 1. Default VSS Blueprint summary of a breakfast preparation video, listing observed actions and basic categories*

The next figure shows a summarization enriched by the Wiki page for a healthy diet. After integrating with the RAG Blueprint, VSS draws on these nutritional guidelines and best practices to add context. The enriched summary describes the actions and highlights the benefits of choosing whole grains, the importance of fiber, the nutritional value of dairy, and the role of hygiene in food safety.

A bullet-point summary of the same breakfast video, but with added context from external nutritional sources. The entries include the health benefits of particular foods, the importance of hygiene, and practical advice for making nutritious choices, such as choosing whole grains and highlighting the protein and calcium in milk. — *Figure 2. VSS summary enriched with RAG, connecting observed actions to nutritional value and healthy habits*

By connecting video understanding to external knowledge, the enriched summary helps viewers make informed decisions about food choices and healthy habits. It translates video content into practical insights that support everyday well-being—making nutrition information accessible and actionable for all.

Deployment steps

To deploy this solution, follow these steps.

NOTE: This example assumes that the RAG Blueprint is already installed and accessible via a remote endpoint.

Download and deploy the RAG Blueprint from https://github.com/NVIDIA-AI-Blueprints/rag.
Clone the video-search-and-summarization repo:

$ git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git

Edit the src/vss-engine/docker/Dockerfile file to apply the integration patches:

diff --git a/src/vss-engine/docker/Dockerfile b/src/vss-engine/docker/Dockerfile
index 58b25e3..e1df783 100644
--- a/src/vss-engine/docker/Dockerfile
+++ b/src/vss-engine/docker/Dockerfile
@@ -17,7 +17,7 @@ RUN --mount=type=bind,source=binaries/gradio_videotimeline-1.0.2-py3-none-any.wh
     pip install --no-deps /tmp/gradio_videotimeline-1.0.2-py3-none-any.whl

 
-RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b v1.0.0 /tmp/vss-ctx-rag
+RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b dev/vss-external-rag-support-v2 /tmp/vss-ctx-rag
 ARG TARGETARCH
 RUN pip install /tmp/vss-ctx-rag --no-deps && \
     if [ "$TARGETARCH" = "amd64" ]; then \

Proceed with the VSS deployment steps in src/vss-engine/README.md to deploy the patched VSS Blueprint.

Test the integration

The following code snippet shows the kubectl exec syntax for executing the VSS pod in Kubernetes with the decorated prompt. It’s analyzing a meal preparation video and enriching it with information about relevant nutritional guidelines.

import subprocess, textwrap

deployment_id = "vss-vss-deployment-595d5b4ccb-8678v"
vid_id        = "6482b573-3aa6-4231-b981-a3e75806826b"

def run_in_vss(pod, cmd):
    subprocess.run(
        ["kubectl", "exec", pod, "-c", "vss", "--",
         "/bin/bash", "-c", cmd],
        check=True, text=True)

prompt = textwrap.dedent("""
  Summarize key events only.
  <e>Breakfast nutriontal guidelines?<e>
""")

cmd = f"""python3 via_client_cli.py summarize \
  --id {vid_id} --model vila-1.5 --enable-chat \
  --chunk-duration 10 \
  --caption-summarization-prompt "{prompt}"
"""

run_in_vss(deployment_id, cmd)

Everything inside <e>…<e> tags is sent to the RAG Blueprint.

Returned context is inserted into the enrichment prompt set in the tunable VECTOR_RAG_ENRICHMENT_PROMPT before LLM generation.

The tunable enrichment prompt used in the nutritional example is pictured below.

Here is the summary generated about the meal preparation video:  
{original_response}

Here is additional nutritional and food safety information:  
{external_context}

Please enrich the summary by naturally incorporating relevant nutritional facts, food safety guidelines, and practical advice from the external context. Connect observed actions in the video to their health benefits, such as highlighting the value of specific ingredients, cooking methods, or hygiene practices. Ensure the enrichment is contextual, informative, and supports everyday healthy choices.

Do not include any introductory phrases, notes, explanations, or comments about how the inputs were combined. Do not reference the original summary or external context. Only provide the enriched summary itself, organized as bullet points under the categories: Ingredient Selection, Cooking Techniques, Nutritional Insights, Hygiene Practices, and Presentation Tips.

How it works

Ingestion
- VSS ingests video streams, creates caption chunks, and indexes the visual metadata.
- RAG ingests proprietary documents such as manuals, historical event statistics, and media guides into a GPU-accelerated vector store.
Query flow
- A user asks, “Am I eating healthy today?”
- VSS surfaces candidate segments of the user’s meal.
- VSS also queries the RAG server to fetch the relevant knowledge indexed from various health guidelines.
Knowledge fusion
- The RAG Blueprint retrieves relevant enterprise health knowledge and feeds it to the VSS LLM to craft a grounded answer along with the candidate segment from the video
Response
- The final response is anchored in the video data, enriched with relevant external knowledge, and delivered to the user in real time with proper citations.

VSS and RAG Blueprints integration architecture

Figure 3 shows the modular integration architecture that produces these results.

VSS ingests video streams, generates captions and metadata, and supports question-answering and summarization over video content.
The RAG Blueprint is deployed as a stand-alone microservice. It indexes, searches, and retrieves knowledge from enterprise-wide data sources such as text documents, PDFs, tables, and policy manuals.
VSS and RAG Blueprints communicate over defined APIs. Whenever a prompt includes text within <e> … <e> tags, the VSS Blueprint sends that sub-prompt to the external RAG server.
The RAG Blueprint receives the sub-prompt and returns relevant context.
The VSS Blueprint uses a customizable enrichment prompt to fuse the retrieved context into its final summarization or chat Q&A response.

This modular, API-based integration enables the blueprints to be used together or separately, and to scale independently based on user demand.

Architecture diagram showing the integration of VSS and RAG Blueprints. Detailing the connection of the video analysis pipeline to the external RAG service, emphasizing modular composability and separate microservices. — *Figure 3. Architecture diagram of the VSS and RAG Blueprint solution*

Connecting workflows: How composable AI Blueprints support collaboration

By composing multiple NVIDIA AI Blueprints, developers can integrate specialized pipelines—such as video analytics and enterprise retrieval—to solve cross-functional challenges. This modular composability accelerates development while extending functionality beyond what any single blueprint can achieve.

Let’s break down how composability delivers flexible integration, cross-team collaboration, and context-rich results:

Flexible integration: Combine specialized blueprints, VSS for video processing, and RAG for knowledge retrieval to build tailored, scalable solutions.
Cross-functional collaboration: Distinct blueprints enable cooperation between video engineers, data scientists, and subject-matter experts, enriching video analytics with enterprise knowledge.
Context-aware results: User queries in VSS Blueprints can use RAG Blueprints to supplement video summaries with relevant information from organizational documents, for precise, actionable insights.

The VSS Blueprint processes video streams for detection and captioning, while the RAG Blueprint retrieves relevant information from text and structured data sources. User queries to VSS Blueprints can be forwarded to RAG Blueprints for additional context, and the combined response incorporates both video analysis and enterprise knowledge.

Optimizing for enterprise workflows: The case for dedicated RAG

A key architectural decision to keep RAG Blueprint as a separate, standalone server instead of merging all sources, such as video and docs, was driven by several real-world factors:

Multi-workstream support: The RAG Blueprint serves multiple workflows (search portals, chatbots, dashboards, compliance tools) as a unified knowledge layer. The VSS Bleuprints acts as one of many clients accessing this backend.
Decoupled scaling: The blueprints can be scaled and optimized independently for targeted resource allocation for video and document workloads.
Rapid innovation and security: Centralized RAG management simplifies updates, patching, and security improvements without affecting VSS deployments.
Minimal Integration Overhead: VSS integration requires only the RAG server endpoint and environment variables—no need to rebuild or re-index video data for new use cases.

It is important to note that the VSS Blueprint also includes RAG capability. Although the VSS Blueprint can also retrieve enterprise documents, the pipeline is highly tuned for accurate video search and retrieval. Similarly, the RAG Blueprint also supports many of the same modalities as the VSS Blueprint. But the RAG Blueprint is optimized to search and retrieve multi-lingual, multi-modal business documents such as PDFs that include text, tables, and charts. Loosely coupling the pipelines via API calls gives the developers a “best of both worlds” experience across two highly specialized pipelines.

Latency impact

We also assessed the performance impact of combining the blueprints for video summarization and Q&A. The total latency includes time spent in VSS operations, time spent in RAG operations, and time spent integrating the results.

$\displaystyle \LARGE\text{Latency}_{\text{total}} = \text{Latency}_{\text{VSS}} + \text{Latency}_{\text{RAG}} + \text{Latency}_{\text{LLM}}$

The system latency for each use case is depicted in Table 1.

In the chat Q&A use case, the addition of RAG input accounts for 10% of the overall latency. Enriching the video summarization with RAG data incurs about 1% of the overall pipeline latency.

Bar chart displaying runtime percentages for each system component in the VSS and Enterprise RAG pipeline. The chart shows VSS as the largest portion, with smaller shares for Enterprise RAG and LLM Fusion. — *Figure 4. VSS and RAG Blueprint runtime percent by component*

Pipeline Stage	VSS Summarization Latency (seconds)	VSS Chat Q&A Latency (seconds)
RAG retrieval	1.69	1.81
LLM fusion	1.24	1.35
End-to-End	250	29.77
VSS Summarization / Chat Q&A (Main Task)	247.07	26.61

Table 1. VSS and Enterprise RAG composable Blueprint expected system runtimes per pipeline

How industries are using blueprints to make smarter, faster decisions

From construction sites to forests to stadiums, the integration of VSS and RAG Blueprints through prompt fusion converts raw video into valuable, context-rich insights without incurring additional latency. The following examples highlight how the integration is helping address real-world challenges:

Shimizu implements the technology on construction sites to stream job-site footage, monitor development progress, prevent unsafe behaviors, and improve safety and compliance.
Cloudian’s HyperScale AIDP forestry management demo deploys VSS and RAG Blueprints for detecting overgrowth and invasive species, instantly retrieving relevant policy documents, and generating actionable reports for fire insurance and compliance.
Monks uses the solution to quickly generate personalized sports highlights, turning large content libraries into tailored, engaging clips for social and broadcast platforms.