Agentic AI / Generative AI

Build an Enterprise-Scale Multimodal PDF Data Extraction Pipeline with an NVIDIA AI Blueprint

Aug 28, 2024

By Tanay Varshney, Annie Surla, Nicola Sessions and Sean Sodha

Discuss (1)

AI-Generated Summary

Dislike

The NVIDIA AI Blueprint for multimodal PDF data extraction combines NVIDIA NeMo Retriever and NVIDIA NIM microservices to extract insights from PDFs containing text, images, charts, and tables.
NVIDIA NIM microservices, including nv-yolox-structured-image, DePlot, CACHED, and PaddleOCR, are used to detect and extract information from various modalities in PDFs, such as charts and tables.
The NVIDIA AI Blueprint provides a cost-effective and stable solution for enterprises to extract valuable insights from their data, with NVIDIA partnering with data and storage platform partners to enable this capability.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Trillions of PDF files are generated every year, each file likely consisting of multiple pages filled with various content types, including text, images, charts, and tables. This goldmine of data can only be used as quickly as humans can read and understand it.

But with generative AI and retrieval-augmented generation (RAG), this untapped data can be used to uncover business insights that can help employees work more efficiently and result in lower costs.

Imagine being able to accurately extract the knowledge contained in massive volumes of enterprise data—effectively talking to the data—to quickly make your digital human an expert on any topic. In turn, this enables your employees to make smarter decisions faster.

In this post, we show how the NVIDIA AI Blueprint for multimodal PDF data extraction combines NVIDIA NeMo Retriever and NVIDIA NIM microservices, along with reference code and documentation to do this.

Tackling the challenge of complex data extraction

PDFs are content-rich documents that store refined information expressed across modalities to make it more concise and digestible. For example, a PDF might include a mixture of text, tables, charts, plots, and diagrams used to convey complex information. From the lens of information retrieval, each of these modalities presents unique challenges.

To build pipelines for solving these challenges, you can use the following NVIDIA NIM microservices:

PDF Ingestion NIM microservices
- nv-yolox-structured-image: A fine-tuned object detection model to detect charts, plots, and tables in PDFs.
- Deplot: A popular community pix2struct model for generating descriptions of charts.
- CACHED: An object detection model used to identify various elements in graphs.
- PaddleOCR: An optical character recognition (OCR) model to transcribe text from tables and charts.
NVIDIA NeMo Retriever NIM microservices
- nv-embedqa-e5-v5: A popular community base-embedding model optimized for text question-answering retrieval.
- nv-rerankqa-mistral4b-v3: A popular community base model fine-tuned for text reranking for high-accuracy question answering.

For more information, see An Easy Introduction to Multimodal Retrieval-Augmented Generation.

Multimodal retrieval blueprint for RAG on PDFs

Building a multimodal PDF data extraction pipeline involves two key steps:

Ingest documents with multimodal data.
Retrieve relevant context based on a user query.

Ingest documents with multimodal data

This is the first half of the workflow, which effectively extracts information and makes it available for retrieval. This involves the following steps:

First, parse the PDFs to separate out the modalities (text, images, charts, tables, plots, and other diagrams). Text is parsed as structured JSON while pages are parsed as images, where each page in the document is rendered as an image.

Next extract textual metadata from charts and tables. Use NIM microservices to accurately extract information from images:

nv-yolox-structured-image: Identify the charts and tables in the PDF
DePlot, CACHED, and PaddleOCR: Extract information from charts. DePlot transcribes the graphs and CACHED with PaddleOCR extracts important additional metadata about the graph.
PaddleOCR: Extract text information from tables, maintaining the reading order of the table.

Finally, filter the extracted information, chunking and creating a VectorStore. The extracted information undergoes filtering to avoid duplicates and gets broken down into appropriate chunks. The NeMo Retriever embedding NIM microservice then converts the chunks into embeddings and stores them in a VectorStore.

Retrieve relevant context based on a user query

When a user submits a query, the relevant information is retrieved from the vast repository of the ingested documents. This is achieved as follows:

The NeMo Retriever embedding NIM microservice embeds the user query, which is used to retrieve the most relevant chunks using vector similarity search from the VectorStore.
The NeMo Retriever reranking NIM microservice acts as a layer of refinement, carefully evaluating and re-ranking the results to ensure the most accurate and useful chunks are used for responding to the query.
With the most pertinent information at hand, the LLM NIM microservice generates a response that is informed, accurate, and contextually relevant.

By using the comprehensive knowledge base built from the ingested documents, this workflow enables users to access precise and relevant information, providing valuable insights and answers to their queries.

Building a cost-effective enterprise-grade RAG pipeline

Here are the benefits of using NIM microservices to create a multimodal PDF data extraction pipeline: cost and stability.

Cost has two considerations:

Time to market: NVIDIA NIM microservices are designed to be easy-to-use and scalable model inference solutions, enabling enterprise application developers to focus on working on their application logic rather than having to spend cycles on building and scaling out the infrastructure. NIM microservices are containerized solutions, which come with industry-standard APIs and Helm charts to scale.
Cost of deployment: NIM uses the full suite of NVIDIA AI Enterprise software to accelerate model inference, maximizing the value enterprises can derive from their models and in turn reducing the cost of deploying the pipelines at scale. Figure 2 demonstrates the improvements in accuracy and throughput achieved in testing this ingestion and extraction pipeline.

Multimodal PDF retrieval accuracy evaluated on publicly available dataset of PDFs consisting of text, charts, and tables, with NIM-On: nv-yolox-structured-image-v1, DePlot, CACHED, PaddleOCR, nv-embedqa-e5-v5, nv-rerankqa-mistral-4b-v3 compared with NIM-Off: open-source alternatives, on 2xA100 GPUs.

Multimodal PDF ingestion throughput pages per second, evaluated on publicly available dataset of PDFs consisting of text, charts, and tables, with NIM-On: nv-yolox-structured-image-v1, DePlot, CACHED, PaddleOCR, nv-embedqa-e5-v5, nv-rerankqa-mistral-4b-v3 compared with NIM-Off: open-source alternative running on multithreaded CPUs.

NIM microservices are part of the NVIDIA AI Enterprise license, which offers API stability, security patches, quality assurance, and support for a smooth transition from prototype to production for enterprises that run their businesses on AI (Figure 3).

Uncovering intelligence in enterprise data

To enable enterprises to make the most of their troves of data, NVIDIA is partnering with data and storage platform partners including Box, Cloudera, Cohesity, DataStax, Dropbox, and Nexla.

Cloudera

“With the integration of NVIDIA NIM microservices in the Cloudera AI Inference service (available now as Tech Preview), companies can match the exabytes of private data managed in Cloudera with the high-performance models powering RAG use cases,” said Priyank Patel, vice president of enterprise AI products at Cloudera.

“Our collaboration with NVIDIA enables best-in-class AI platform capabilities for enterprises wherever they choose to run their AI, on-premises and on cloud.”

Cohesity

“To unlock the full potential of their proprietary data for AI applications, enterprises must efficiently process and analyze vast amounts of information stored in their backups and archives,” said Greg Statton, CTO of Data & AI at Cohesity.

“The NeMo Retriever multimodal PDF workflow has the potential to add generative AI intelligence to our customers’ data backups and archives, enabling them to extract valuable insights from millions of documents quickly and accurately. Bringing together this workflow with Cohesity Gaia can allow our customers to focus on innovation and strategic decision-making rather than grappling with complex data integration challenges.”

Datastax

“Unlocking value from proprietary enterprise data for AI applications requires ingesting and extracting knowledge from millions of structured and unstructured documents,” said Ed Anuff, chief product officer at Datastax.

“We’re teaming with NVIDIA to leverage the speed and scale of accelerated computing and the NeMo Retriever data extraction workflow for PDFs, with DataStax AstraDB and DataStax Hyper-Converged Database to enable customers to focus on innovation rather than complex data integration challenges.”

Dropbox

“Expanding beyond text retrieval to tables and images can enable customers to unlock insights across their cloud content,” said Manik Singh, general manager at Dropbox.

“We are evaluating NeMo Retriever multimodal PDF extraction workflow, as an option to explore bringing new generative AI capabilities to help our customers uncover these valuable insights.”

Nexla

“Scaling generative AI demos to production-grade solutions is a big challenge for enterprises. Our collaboration can address this with the integration of NVIDIA NIM in Nexla’s no-code/low-code platform for Document ETL, with the potential to scale multimodal ingestion across millions of documents in enterprise systems including Sharepoint, SFTP, S3, Network Drives, Dropbox, and more,” said Saket Saurabh, CEO and co-founder of Nexla.

“Nexla will support NIM in both cloud and private data center environments, across the full capability set including embedding generation, model execution, reasoning, and retrieval solutions to help customers accelerate their AI roadmap,” said Saurabh.

Get started

Experience the NVIDIA AI Blueprint for multimodal PDF data extraction with our interactive demo in the NVIDIA API catalog. Apply for early access to preview this workflow blueprint using open-source code, customization instructions, and a Helm chart for deployment.

Join developers from around the globe in building a RAG application, elevating your skills, and competing for exciting prizes by registering for the NVIDIA and LlamaIndex developer contest.

Discuss (1)

About the Authors

About Tanay Varshney
Tanay Varshney is a senior product research engineer at NVIDIA working with NeMo and NIMs to improve LLMs and agents. He has a master's degree in computer science from New York University focused on the cross section of computer vision, data visualization, and urban analytics.

View all posts by Tanay Varshney

About Annie Surla
Annie Surla is a Developer Advocate Engineer at NVIDIA responsible for developing and presenting a wide range of deep learning software products. She comes with experience working in deep learning applications including vision and NLP. She holds a master’s degree in Engineering Management from Duke University.

View all posts by Annie Surla

About Nicola Sessions
Nicola Sessions is director of product marketing for NVIDIA agentic AI software. She’s focused on helping enterprises discover how data intelligence, conversational AI, and AI agents combine to transform the workplace. Prior to NVIDIA, Nicola held product management and product marketing roles covering virtualization, data center, cloud, and end user computing technologies.

View all posts by Nicola Sessions

About Sean Sodha
Sean Sodha is a deep learning senior product manager at NVIDIA, responsible for multimodal retrieval and document ingestion for NeMo Retriever. Before joining NVIDIA, Sean ran his own AI venture and was formerly at IBM Watson. He holds an MBA from the Wharton School of Business, M.Sc. in engineering from Cornell University, and B.Sc. in electrical engineering from Purdue University.

View all posts by Sean Sodha