Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities

Enterprise data is inherently complex: real-world documents are multimodal, spanning text, tables, charts and graphs, images, diagrams, scanned pages, forms, and embedded metadata. Financial reports carry critical insights in tables, engineering manuals rely on diagrams, and legal documents often include annotated or scanned content.

Retrieval-augmented generation (RAG) was created to ground LLMs in trusted enterprise knowledge—retrieving relevant source data at query time to reduce hallucinations and improve accuracy. But if a RAG system processes only surrounding text, it misses key signals embedded in tables, charts, and diagrams—resulting in incomplete or incorrect answers.

An intelligent agent is only as good as the data foundation it’s built on. Modern RAG must therefore be inherently multimodal—able to understand both visual and textual context to achieve enterprise-grade accuracy. The NVIDIA Enterprise RAG Blueprint is built for this, providing a modular reference architecture that connects unstructured enterprise data to the intelligent systems built on top of it.

The blueprint also serves as a foundational layer for the NVIDIA AI Data Platform, helping to bridge the traditional gap between compute and data. By enabling retrieval and reasoning closer to the data layer, it preserves governance, reduces operational friction, and makes enterprise knowledge immediately usable by intelligent systems. The result is a modern AI data stack—storage that can retrieve, enrich, and reason alongside your models.

While the Enterprise RAG Blueprint provides many configurable options, this post highlights the following five key configurations that most directly improve accuracy and contextual relevance across enterprise use cases:

Baseline multimodal RAG pipeline
Reasoning
Query decomposition
Filtering metadata for faster and precise retrieval
Visual reasoning for multimodal data

The post also explains how the blueprint can be embedded into AI data platforms to transform traditional repositories into AI-ready knowledge systems.

Accuracy metrics in this blog are measured using the RAGAS framework, using well-known public datasets. Learn more about evaluating your NVIDIA RAG Blueprint system.

1. Document ingestion and understanding

Before an agent can deliver insights, it must be perfectly grounded in your data. This foundational configuration focuses on intelligent document ingestion and core RAG functionality.

The Enterprise RAG Blueprint uses NVIDIA NeMo Retriever to extract multimodal enterprise content—text, tables, charts and graphs, and infographics—then embeds that content into text for indexing in a vector database. At query time, the blueprint runs semantic retrieval, reranking, and Nemotron LLM to generate a grounded answer.

To maximize performance, this baseline intentionally avoids image captioning and heavy reasoning, making it the ideal starting point for production deployments. Deploy this baseline on Docker.

Benefits of document ingestion and understanding

This foundational configuration is the blueprint’s highest-efficiency pipeline, optimized for accuracy and throughput while keeping GPU cost and time to first token (TTFT) low. This configuration establishes your baseline performance for retrieval quality and LLM grounding.

Diagram showing RAG pipeline (top) and ingestion pipeline (center/bottom) with arrows showing flow between icons labeled with: User, Nemotron Safety, Query Processing, Nemotron Rerank, Data Catalog, and more. — *Figure 1. RAG pipeline*

Table 1 summarizes the overall impact across a few datasets.

Accuracy (v2.3 Default) MM = Multimodal, TO = Text-Only
Dataset	Type	Accuracy
RAG Battle	MM	0.809
KG RAG	MM	0.565
FinanceBench	MM	0.633
BO767	MM	0.910
HotpotQA	TO	0.671
Google Frames	MM	0.509

Table 1. Accuracy impact of baseline configuration (higher is better)

2. Reasoning

When you turn on reasoning in the RAG blueprint, you enable the LLM to interpret the retrieved evidence, and synthesize logically grounded answers. This is the easiest change to get an accuracy boost for many applications. Enable reasoning for the NVIDIA Enterprise RAG Blueprint.

Table 2 summarizes the overall impact across several sample datasets.

Accuracy (v2.3 Default) plus Reasoning MM = Multimodal, TO = Text-Only
Dataset	Type	Reasoning on	Default
RAG Battle	MM	0.85	0.809
KG RAG	MM	0.58	0.565
FinanceBench	MM	0.69	0.633
BO767	MM	0.88	0.91

Table 2. Accuracy impact of enabling reasoning versus baseline configuration (higher is better)

Benefits of reasoning

For any use case involving mathematical operations or complex data comparison, a typical simple similarity or hybrid search will not suffice. Reasoning is required to correct errors and ensure precise contextual understanding. Accuracy improvements across datasets averaged ~5%, with several cases demonstrating dramatic reasoning-driven corrections.

Examples

In the FinanceBench dataset, the baseline configuration incorrectly computed the Adobe FY2017 operating cash flow ratio as 2.91. After enabling reasoning, the model produced the correct answer, 0.83. In addition, the Ragbattle dataset demonstrates the accuracy improvement from enabling VLM.

3. Query decomposition

Answering complex user questions often requires pulling facts from multiple places in the data foundation. Query decomposition breaks a single question into smaller subqueries, retrieves evidence for each, and recombines the results into a complete, grounded response. Turn on query decomposition for the NVIDIA Enterprise RAG Blueprint.

GIF showing response accuracy before and after query decomposition. — *Figure 2. Response accuracy before and after query decomposition*

Benefits of query decomposition

Query decomposition significantly improves accuracy for multihop and context-rich questions that span multiple paragraphs or documents. It does add extra LLM calls (increasing latency and cost), but the accuracy gains are often worth it for mission-critical enterprise use cases. Query decomposition can also be paired with reasoning for an additional boost when needed.

Example

As NVIDIA AI Data platform partners evolve to offer more relevant and accurate retrieval, this feature can either include some level of query processing as part of the data platform or can be left to the agent. Learn more about how query decomposition can be an approach in some use cases.

Table 3 shows the overall impact across a few datasets.

Accuracy (v2.3 Default) plus Query Decomposition MM = Multimodal, TO = Text-Only
Dataset	Type	Query decomposition	Default
RAG Battle	MM	0.854	0.809
FinanceBench	MM	0.631	0.633
BO767	MM	0.885	0.91
HotpotQA	TO	0.725	0.671
Google Frames	MM	0.6	0.5094

Table 3. Accuracy impact of query decomposition versus baseline configuration (higher is better)

4. Filtering metadata for faster and precise retrieval

Metadata, such as author, date, category, and security tags, has always been integral to enterprise data. In RAG pipelines, metadata filters can be leveraged to narrow the search space and align retrieved content with the right context, significantly improving retrieval precision and speed.

The RAG blueprint supports custom metadata ingestion and automatic query generation based on that data. To leverage your custom metadata, see Advanced Metadata Filtering with Natural Language Generation. To learn more about what’s possible with this feature set, check out the example notebook on the NVIDIA-AI-Blueprints/rag GitHub repo.

Benefits of metadata filtering

Metadata filtering narrows the search space for faster retrieval and improves precision by aligning retrieved content with context. This allows developers to leverage metadata without manual filter logic to achieve higher throughput and contextual relevance. When metadata filtering capabilities are embedded directly into AI data platforms, it can make your storage smarter, leading to faster retrieval and lower latency.

Example

To provide an example, consider two documents that are ingested with the following metadata:

custom_metadata = [
    {
        "filename": "ai_guide.pdf",
        "metadata": {
            "category": "AI",
            "priority": 8,
            "rating": 4.5,
            "tags": ["machine-learning", "neural-networks"],
            "created_date": "2024-01-15T10:30:00"
        }
    },
    {
        "filename": "engineering_manual.pdf",
        "metadata": {
            "category": "engineering",
            "priority": 5,
            "rating": 3.8,
            "tags": ["hardware", "design"],
            "created_date": "2023-12-20T14:00:00"
        }
    }

When using metadata with dynamic filter expression, a query such as, “Show me high-rated AI documents with machine learning tags created after January 2024” will translate to one that automatically generates a filtering expression such as:

filter_expression = `content_metadata["category"] == "AI" and content_metadata["rating"] >= 4.0 and
array_contains(content_metadata["tags"], "machine-learning") and content_metadata["created_date"] >= "2024-01-01”`

With metadata filtering enabled, the system retrieved 10 focused citations from one document, ai_guide.pdf, achieving 100% precision on the target domain while reducing search space by 50%.

5. Visual reasoning for multimodal data

Enterprise data is visually rich. Where traditional text-only embeddings fall short, vision language models (VLMs) such as NVIDIA Nemotron Nano 2 VL (12B) introduce visual reasoning into the pipeline. Learn more about how to leverage a VLM for generation in the RAG Blueprint.

GIF showing before and after leveraging a VLM for generation. — *Figure 3. Before and after leveraging a VLM for generation*

Benefits of visual reasoning

Visual reasoning is crucial for handling real-world enterprise documents. Integrating a VLM in the generation pathway enables the RAG system to interpret images, charts, and infographics, making it possible to accurately answer queries where the information lies in a structured visual element rather than just the surrounding text.

Example

A significant accuracy improvement was observed when a VLM was enabled for the Ragbattle dataset in the RAG Blueprint, especially when the answer was in a visual element. Note that enabling VLM inference can increase response latency from additional image processing. Consider this tradeoff between accuracy and speed based on your requirements. Learn more about the accuracy improvements with VLM for the Ragbattle dataset.

Transforming enterprise storage into an active knowledge system

The Enterprise RAG Blueprint demonstrates how the progressive adoption of these five capabilities—from reasoning and metadata-driven retrieval to multimodal understanding—directly enhances the accuracy and groundedness of your intelligent agents. Each capability offers a unique balance between latency, token cost, and contextual precision, providing a flexible, tunable framework that can be adopted to various enterprise use cases.

This accelerates the evolution of the data foundation itself. The NVIDIA AI Data Platform transforms enterprise data into AI-searchable knowledge. As NVIDIA partners evolve their storage offerings, this blueprint serves as a reference for delivering embedded RAG capabilities that leverage metadata to enforce permissions, track changes, and provide highly accurate retrieval directly at the storage layer.

NVIDIA storage partners are building AI data platforms based on the NVIDIA reference design that are transforming enterprise storage from a passive repository to become an active intelligent system in the AI workflow. The result is a next-generation enterprise data infrastructure: faster, smarter, and purpose-built for the age of generative AI.

What’s new with the NVIDIA Enterprise RAG Blueprint

The latest release of the NVIDIA EnterpriseRAG Blueprint deepens its focus on serving agentic workflows. It introduces first-class document-level summarization with both shallow and deep strategies, enabling agents to quickly assess relevance, narrow search space, and balance accuracy with latency. A new data catalog improves discoverability and governance across large corpora, while upgrades to the best-in-class Nemotron RAG models further enhance retrieval quality, reasoning, and generation performance—making RAG a more efficient, agent-ready foundation for enterprise-scale knowledge systems.

Get started with enterprise-grade RAG

Ready to integrate these five capabilities into your RAG use cases? Access the modular code, documentation, and evaluation notebooks for free within the NVIDIA Enterprise RAG Blueprint.

Make your enterprise data AI-ready and transform your production data into an intelligent knowledge system with embedded RAG capabilities with NVIDIA AI Data Platform. Contact an NVIDIA AI storage partner to get started with your own NVIDIA-powered AI data platform.