Trillions of PDF files are generated every year, each file likely consisting of multiple pages filled with various content types, including text, images, charts, and tables. This goldmine of data can only be used as quickly as humans can read and understand it.
But with generative AI and retrieval-augmented generation (RAG), this untapped data can be used to uncover business insights that can help employees work more efficiently and result in lower costs.
Imagine being able to accurately extract the knowledge contained in massive volumes of enterprise data—effectively talking to the data—to quickly make your digital human an expert on any topic. In turn, this enables your employees to make smarter decisions faster.
In this post, we show how the NVIDIA AI Blueprint for multimodal PDF data extraction combines NVIDIA NeMo Retriever and NVIDIA NIM microservices, along with reference code and documentation to do this.
Tackling the challenge of complex data extraction
PDFs are content-rich documents that store refined information expressed across modalities to make it more concise and digestible. For example, a PDF might include a mixture of text, tables, charts, plots, and diagrams used to convey complex information. From the lens of information retrieval, each of these modalities presents unique challenges.
To build pipelines for solving these challenges, you can use the following NVIDIA NIM microservices:
- PDF Ingestion NIM microservices
- nv-yolox-structured-image: A fine-tuned object detection model to detect charts, plots, and tables in PDFs.
- Deplot: A popular community pix2struct model for generating descriptions of charts.
- CACHED: An object detection model used to identify various elements in graphs.
- PaddleOCR: An optical character recognition (OCR) model to transcribe text from tables and charts.
- NVIDIA NeMo Retriever NIM microservices
- nv-embedqa-e5-v5: A popular community base-embedding model optimized for text question-answering retrieval.
- nv-rerankqa-mistral4b-v3: A popular community base model fine-tuned for text reranking for high-accuracy question answering.
For more information, see An Easy Introduction to Multimodal Retrieval-Augmented Generation.
Multimodal retrieval blueprint for RAG on PDFs
Building a multimodal PDF data extraction pipeline involves two key steps:
- Ingest documents with multimodal data.
- Retrieve relevant context based on a user query.
Ingest documents with multimodal data
This is the first half of the workflow, which effectively extracts information and makes it available for retrieval. This involves the following steps:
First, parse the PDFs to separate out the modalities (text, images, charts, tables, plots, and other diagrams). Text is parsed as structured JSON while pages are parsed as images, where each page in the document is rendered as an image.
Next extract textual metadata from charts and tables. Use NIM microservices to accurately extract information from images:
- nv-yolox-structured-image: Identify the charts and tables in the PDF
- DePlot, CACHED, and PaddleOCR: Extract information from charts. DePlot transcribes the graphs and CACHED with PaddleOCR extracts important additional metadata about the graph.
- PaddleOCR: Extract text information from tables, maintaining the reading order of the table.
Finally, filter the extracted information, chunking and creating a VectorStore. The extracted information undergoes filtering to avoid duplicates and gets broken down into appropriate chunks. The NeMo Retriever embedding NIM microservice then converts the chunks into embeddings and stores them in a VectorStore.
Retrieve relevant context based on a user query
When a user submits a query, the relevant information is retrieved from the vast repository of the ingested documents. This is achieved as follows:
- The NeMo Retriever embedding NIM microservice embeds the user query, which is used to retrieve the most relevant chunks using vector similarity search from the VectorStore.
- The NeMo Retriever reranking NIM microservice acts as a layer of refinement, carefully evaluating and re-ranking the results to ensure the most accurate and useful chunks are used for responding to the query.
- With the most pertinent information at hand, the LLM NIM microservice generates a response that is informed, accurate, and contextually relevant.
By using the comprehensive knowledge base built from the ingested documents, this workflow enables users to access precise and relevant information, providing valuable insights and answers to their queries.
Building a cost-effective enterprise-grade RAG pipeline
Here are the benefits of using NIM microservices to create a multimodal PDF data extraction pipeline: cost and stability.
Cost has two considerations:
- Time to market: NVIDIA NIM microservices are designed to be easy-to-use and scalable model inference solutions, enabling enterprise application developers to focus on working on their application logic rather than having to spend cycles on building and scaling out the infrastructure. NIM microservices are containerized solutions, which come with industry-standard APIs and Helm charts to scale.
- Cost of deployment: NIM uses the full suite of NVIDIA AI Enterprise software to accelerate model inference, maximizing the value enterprises can derive from their models and in turn reducing the cost of deploying the pipelines at scale. Figure 2 demonstrates the improvements in accuracy and throughput achieved in testing this ingestion and extraction pipeline.
Multimodal PDF retrieval accuracy evaluated on publicly available dataset of PDFs consisting of text, charts, and tables, with NIM-On: nv-yolox-structured-image-v1, DePlot, CACHED, PaddleOCR, nv-embedqa-e5-v5, nv-rerankqa-mistral-4b-v3 compared with NIM-Off: open-source alternatives, on 2xA100 GPUs.
Multimodal PDF ingestion throughput pages per second, evaluated on publicly available dataset of PDFs consisting of text, charts, and tables, with NIM-On: nv-yolox-structured-image-v1, DePlot, CACHED, PaddleOCR, nv-embedqa-e5-v5, nv-rerankqa-mistral-4b-v3 compared with NIM-Off: open-source alternative running on multithreaded CPUs.
NIM microservices are part of the NVIDIA AI Enterprise license, which offers API stability, security patches, quality assurance, and support for a smooth transition from prototype to production for enterprises that run their businesses on AI (Figure 3).
Uncovering intelligence in enterprise data
To enable enterprises to make the most of their troves of data, NVIDIA is partnering with data and storage platform partners including Box, Cloudera, Cohesity, DataStax, Dropbox, and Nexla.
Cloudera
“With the integration of NVIDIA NIM microservices in the Cloudera AI Inference service (available now as Tech Preview), companies can match the exabytes of private data managed in Cloudera with the high-performance models powering RAG use cases,” said Priyank Patel, vice president of enterprise AI products at Cloudera.
“Our collaboration with NVIDIA enables best-in-class AI platform capabilities for enterprises wherever they choose to run their AI, on-premises and on cloud.”
Cohesity
“To unlock the full potential of their proprietary data for AI applications, enterprises must efficiently process and analyze vast amounts of information stored in their backups and archives,” said Greg Statton, CTO of Data & AI at Cohesity.
“The NeMo Retriever multimodal PDF workflow has the potential to add generative AI intelligence to our customers’ data backups and archives, enabling them to extract valuable insights from millions of documents quickly and accurately. Bringing together this workflow with Cohesity Gaia can allow our customers to focus on innovation and strategic decision-making rather than grappling with complex data integration challenges.”
Datastax
“Unlocking value from proprietary enterprise data for AI applications requires ingesting and extracting knowledge from millions of structured and unstructured documents,” said Ed Anuff, chief product officer at Datastax.
“We’re teaming with NVIDIA to leverage the speed and scale of accelerated computing and the NeMo Retriever data extraction workflow for PDFs, with DataStax AstraDB and DataStax Hyper-Converged Database to enable customers to focus on innovation rather than complex data integration challenges.”
Dropbox
“Expanding beyond text retrieval to tables and images can enable customers to unlock insights across their cloud content,” said Manik Singh, general manager at Dropbox.
“We are evaluating NeMo Retriever multimodal PDF extraction workflow, as an option to explore bringing new generative AI capabilities to help our customers uncover these valuable insights.”
Nexla
“Scaling generative AI demos to production-grade solutions is a big challenge for enterprises. Our collaboration can address this with the integration of NVIDIA NIM in Nexla’s no-code/low-code platform for Document ETL, with the potential to scale multimodal ingestion across millions of documents in enterprise systems including Sharepoint, SFTP, S3, Network Drives, Dropbox, and more,” said Saket Saurabh, CEO and co-founder of Nexla.
“Nexla will support NIM in both cloud and private data center environments, across the full capability set including embedding generation, model execution, reasoning, and retrieval solutions to help customers accelerate their AI roadmap,” said Saurabh.
Get started
Experience the NVIDIA AI Blueprint for multimodal PDF data extraction with our interactive demo in the NVIDIA API catalog. Apply for early access to preview this workflow blueprint using open-source code, customization instructions, and a Helm chart for deployment.
Join developers from around the globe in building a RAG application, elevating your skills, and competing for exciting prizes by registering for the NVIDIA and LlamaIndex developer contest.