What if your AI agent could instantly parse complex PDFs, extract nested tables, and “see” data within charts as easily as reading a text file? With NVIDIA Nemotron RAG, you can build a high-throughput intelligent document processing pipeline that handles massive document workloads with precision and accuracy.
This post walks you through the core components of a multimodal retrieval pipeline step-by-step. First, we show you how to use the open source NVIDIA NeMo Retriever library to decompose complex documents into structured data using GPU-accelerated microservices. Then, we demonstrate how to wire that data into Nemotron RAG models to ensure your assistant provides grounded, accurate answers with full traceability back to the source.
Let’s dive in.
Quick links to the model and code
Access the following resources for the tutorial:
🧠 Models on Hugging Face:
- nvidia/llama-nemotron-embed-vl-1b-v2 multimodal embedding
- nvidia/llama-nemotron-rerank-vl-1b-v2 cross-encoder reranker
- Extraction models from the Nemotron RAG collection
☁️ Cloud endpoints:
- Nemotron OCR document extraction
- nvidia/llama-3.3-nemotron-super-49b-v1.5 answer generation model
- More from NIM models
🛠️ Code and documentation:
- NeMo Retriever Library (GitHub)
- Tutorial Notebook Jupyter notebook available on GitHub
Prerequisites
To follow this tutorial, you need the following:
System requirements:
- Python 3.10 to 3.12 (tested on 3.12)
- NVIDIA GPU with at least 24 GB VRAM for local model deployment
- 250 GB of disk space (for models, datasets, and vector database)
API access:
- NVIDIA API key (obtain free access at build.nvidia.com)
Python environment:
[project]
name = "idp-pipeline"
version = "0.1.0"
description = "IDP Nemotron RAG Pipeline Demo"
requires-python = "==3.12"
dependencies = [
"ninja", "packaging", "wheel", "requests", "python-dotenv", "ipywidgets", # Utils
"markitdown", "nv-ingest==26.1.1", "nv-ingest-api==26.1.1", "nv-ingest-client==26.1.1", # Ingest
"milvus-lite==2.4.12", "pymilvus", "openai>=1.51.0", # Database & API
"transformers", "accelerate", "pillow", "torch", "torchvision", "timm" # ML Core
]
Time required:
One to two hours for complete implementation (longer if compiling GPU-optimized deps like flash-attn)
What you’ll get: A production-ready multimodal RAG pipeline for document processing
The tutorial is available as a launchable Jupyter Notebook on GitHub for hands-on experimentation. The following is an overview of the build process.
- Unlocking trapped data: The process begins by using the NeMo Retriever library to extract information from complex documents.
- Context-aware orchestration: Using a microservice architecture, the pipeline decomposes documents and optimizes the data for Nemotron RAG models, creating a high-speed, contextually aware system.
- High-throughput transformation: By scaling the workload with GPU-accelerated computing and NVIDIA NIM microservices, massive datasets are transformed into searchable intelligence in parallel.
- High precision in retrieval: The refined data is fed into Nemotron RAG, enabling the AI agent to pinpoint exact tables or paragraphs to answer complex queries with high reliability.
- Source-grounded reliability: The final integration wires the retrieval output into an assistant that provides “source-grounded” answers, offering transparent citations back to the specific page or chart.
Why traditional OCR and text-only processing fails on complex documents
Before building your pipeline, it’s important to understand these core challenges that standard text extraction fails to solve:
- Structural complexity: Documents contain matrices and tables where relationships between data are critical. Standard PDF parsers merge columns and rows, destroying structure—turning “Model A: 95°C max” and “Model B: 120°C max” into unusable text. This causes errors in manufacturing, compliance, and decision-making.
- Multimodal content: Critical information lives in charts, diagrams, and scanned images that text-only parsers miss. Performance trends, diagnostic results, and process flowcharts require visual understanding.
- Citation requirements: Regulated industries demand precise citations for audit trails. Answers need traceable references like “Section 4.2, Page 47″—not just facts without provenance.
- Conditional logic: “If-then” rules often span multiple sections. Understanding “Use Protocol A below 0°C, otherwise Protocol B” requires preserving document hierarchy and cross-referencing across pages—essential for technical manuals, policies, and regulatory guidelines.
These challenges explain why Nemotron RAG uses specialized extraction models, structured embeddings, and citation-backed generation rather than simple text parsing.
Key considerations for intelligent document processing deployments
When building your document processing pipeline, these factors determine production viability:
- Chunk size tradeoffs: Smaller chunks (256-512 tokens) enable precise retrieval but may lose context. Larger chunks (1,024-2,048 tokens) preserve context but reduce precision. For enterprise documents, 512-1,024 tokens with 100-200 token overlap balances both needs.
- Extraction depth: Decide whether to segment content by page or keep documents whole. Page-level splitting enables precise citations and verification, while document-level segmentation maintains narrative flow and broader context. Choose based on whether you need exact source locations or a comprehensive understanding.
- Table output format: Converting tables to markdown preserves row/column relationships in an LLM-native format, significantly reducing numeric hallucinations caused by plain text linearization.
- Library vs. container mode: Library mode (SimpleBroker) is suitable for development and small documents (<100 docs). Production deployments require container mode with Redis/Kafka for horizontal scaling across thousands of documents.
These configuration choices directly impact retrieval accuracy, citation precision, and system scalability.
What are the components of a multimodal RAG pipeline?
Your intelligent document processing pipeline has three major stages before generating the cited answer to your questions. Each has a clear input/output contract.
Stage 1: Extraction (Nemotron page elements, table/chart extraction, and OCR)
- Input: PDF files
- Output: JSON with structured items: text chunks, table markdown, chart images
- Runs: Library, self-hosted (Docker), and/or remote client
Stage 2: Embedding (llama-nemotron-embed-vl-1b-v2)
- Input: Extracted items (text, tables, chart images)
- Output: 2048-dim vectors per item and original content
- Key capability: Multimodal—encodes text-only, image-only, or image and text together
- Runs: Locally on your GPU or remotely on NIM (soon)
Stage 3: Reranking (llama-nemotron-rerank-vl-1b-v2)
- Input: Top-K candidates from embedding search
- Output: Ranked list (highest relevance first)
- Key capability: Cross-encoder; sees (query, document, optional image) together
- Runs: Locally on your GPU or remotely on NIM (soon)
- Why it matters: Filters out “looks similar but wrong” results; the VLM version also sees images to verify relevance
Once the processing pipeline is set up, answers can be generated:
Generation (Llama-3.3-Nemotron-Super-49B)
- Input: Top-ranked documents + user question
- Output: Grounded, cited answer
- Key capability: Follows strict system prompt to cite sources, admit uncertainty
- Runs: Locally or NIM on build.nvidia.com

Code for building each pipeline component
Try the starting code for each part of the document processing pipeline.
Extraction
Extraction converts a PDF from “pixels and layout” into structured, queryable units because downstream retrieval and reasoning models can’t reliably operate on raw page coordinates and flattened text without losing meaning. The NeMo Retriever library is built to preserve document structure (tables stay tables, figures stay figures) using specialized extraction capabilities (text, tables, charts/graphics) rather than treating everything as plain text. The World Bank’s “Peru 2017 Country Profile” is a strong stress test because it mixes narrative, charts, and dense appendix tables—the same failure modes that break enterprise RAG if extraction is weak.
# Start nv-ingest (Library Mode) and connect a local client (SimpleClient on port 7671).
print("[INFO] Starting Ingestion Pipeline (Library Mode)...")
run_pipeline(block=False, disable_dynamic_scaling=True, run_in_subprocess=True, quiet=True)
time.sleep(15) # warmup
client = NvIngestClient(
message_client_allocator=SimpleClient,
message_client_port=7671, # Default LibMode port
message_client_hostname="localhost"
)
# Submit an extraction job: keep tables as Markdown + crop charts (for downstream multimodal RAG).
ingestor = (Ingestor(client=client)
.files([PDF_PATH])
.extract(
extract_text=True,
extract_tables=True,
extract_charts=True, # chart crops
extract_images=False, # focus on charts/tables
extract_method="pdfium",
table_output_format="markdown"
)
)
job_results = ingestor.ingest()
extracted_data = job_results[0]
Embedding
Embedding turns each extracted item into a fixed-size vector for millisecond-scale similarity searches over large document collections. Using a multimodal embedder is key to unlocking visually rich PDFs. Because it’s designed to embed document pages as text, image, or image and text, charts and tables can be retrieved as evidence rather than ignored. In this pipeline each item is indexed into Milvus as a 2,048‑dim vector, and the resulting top‑K shortlist passes into reranking.
# Vector DB contract: 2048-dim vectors + original payload/metadata stored in Milvus.
HF_EMBED_MODEL_ID = "nvidia/llama-nemotron-embed-vl-1b-v2"
COLLECTION_NAME = "worldbank_peru_2017"
MILVUS_URI = "milvus_wb_demo.db"
milvus_client = MilvusClient(MILVUS_URI)
if milvus_client.has_collection(COLLECTION_NAME):
milvus_client.drop_collection(COLLECTION_NAME)
milvus_client.create_collection(collection_name=COLLECTION_NAME, dimension=2048, auto_id=True)
# Multimodal encoding: text-only vs image-only vs image+text (table markdown + chart/table crop).
with torch.inference_mode():
if modality == "image_text":
emb = embed_model.encode_documents(images=[image_obj], texts=[content_text])
elif modality == "image":
emb = embed_model.encode_documents(images=[image_obj])
else:
emb = embed_model.encode_documents(texts=[content_text])
# (Notebook then L2-normalizes emb[0] and inserts {vector, text, page, type, has_image, image_b64, ...} into Milvus.)
Reranking
Reranking is the precision layer applied after embedding retrieval. Ranking every document with a cross-encoder is too expensive, so you only rerank the embedder’s shortlist. A multimodal cross‑encoder reranker is especially valuable for enterprise PDFs because it can judge relevance using the same evidence users trust—tables and figures (optionally alongside text)—so “looks similar” gets filtered out and “actually answers” rises. In the notebook, reranking starts from Milvus hits, then continues into a scoring loop (not shown here) that assigns logits per candidate and sorts to produce the final ranked context for answer generation.
# Stage 1: embed query -> dense retrieve from Milvus (high recall).
with torch.no_grad():
q_emb = embed_model.encode_queries([query])[0].float().cpu().numpy().tolist()
hits = milvus_client.search(
collection_name=COLLECTION_NAME,
data=[q_emb],
limit=retrieve_k,
output_fields=["text", "page", "source", "type", "has_image", "image_b64"]
)[0]
# Stage 2: VLM cross-encoder rerank (query + doc_text + optional doc_image) (high precision).
batch = rerank_inputs[i:i+batch_size] # list of {"question","doc_text","doc_image"} dicts (built from hits)
inputs = rerank_processor.process_queries_documents_crossencoder(batch)
inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
with torch.no_grad():
logits = rerank_model(**inputs).logits.squeeze(-1).float().cpu().numpy()
# (Notebook then attaches logits as scores and sorts valid_hits descending.)
What are the next steps for optimizing retrieval?
With your intelligent document processing pipeline live, the path to production is wide open. The power of this setup lies in its flexibility. Try connecting new data sources to the NeMo Retriever library or refine your retrieval accuracy with specialized NIM microservices.
As your document library grows, you’ll find that this architecture serves as a scalable foundation for building multi-agent systems that understand the nuances of your enterprise knowledge. By pairing frontier models with NVIDIA Nemotron via an LLM router, you can sustain this high performance while optimizing for cost and efficiency. You can also find more information on how Justt leveraged Nemotron, enabling a 25% reduction in extraction error rate to increase the reliability of financial chargeback analysis for their customers.
Join the community of developers building with the NVIDIA Blueprint for Enterprise RAG—trusted by a dozen industry-leading AI Data Platform providers, available on build.nvidia.com, GitHub, and the NGC catalog.
Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.