Enterprises generate and store vast amounts of unstructured data in documents like research reports, business contracts, financial statements, and technical manuals. Extracting meaningful insights from this data remains a challenge for traditional optical character recognition (OCR) technologies that struggle with complex layouts, structural variability, and maintaining continuity across pages.
Accurately classifying page elements like headers, footers, and body content is essential for preserving structure across multi-page documents. Tables, charts, and mathematical formulas, as well as nested content, also require a structural understanding beyond basic text recognition. Plus, wide-ranging document densities—from large reports to formatted letters—further complicate OCR processing. These challenges highlight the need for layout-aware, intelligent models that understand documents and reliably preserve meaning, structure, and reading order at scale.
A transformer-based VLM for high-precision document understanding
NVIDIA NeMo Retriever Parse overcomes the shortcomings of OCR technologies. Designed to tackle the hardest aspects of document intelligence, NeMo Retriever Parse is an optimized model built on cutting-edge vision language model (VLM) technology.
It delivers advanced text and table extraction, and document semantic understanding with spatial grounding, transforming structured and unstructured documents into actionable data. It’s part of the NeMo Retriever family of microservices for building multimodal ingestion and retrieval pipelines with high accuracy and maximum data privacy.
At its core, NeMo Retriever Parse is a transformer-based vision-encoder-decoder model designed for high-precision document understanding. Its VLM architecture enables seamless extraction of structured text while preserving document layout, semantic classes, and reading order.
Key capabilities include:
- Accurate text and formulae extraction in reading order.
- Spatial localization and classification of document elements, such as titles, section headers, text, list items, page headers, page footers, captions, tables, figures, formulas, bibliography, table of contents, and footnotes.
- Support for plain text and markdown output formats.
- Seamless integration with enterprise retrieval pipelines for improved searchability and organization.
By bridging raw documents with intelligent AI-driven processing, NeMo Retriever Parse can improve how businesses and researchers interact with their data.
Transforming document AI to enhance downstream retrieval pipelines
The digital world thrives on structured knowledge. Whether it’s scientific research, legal contracts, or enterprise reports, document intelligence is crucial for information accessibility and decision-making. NeMo Retriever Parse transforms document AI by:
- Improves retrieval accuracy: Enhances retrieval pipelines by classifying and segmenting document components accurately. NeMo Retriever Parse uses bounding boxes to retain document layout and classify content types (e.g., headers, paragraphs, captions), ensuring structured, context-aware text extraction.
- Structured content extraction: Boosts large language model (LLM) and VLM accuracy with high-quality, structured text extraction. NeMo Retriever Parse enriches training datasets and inference pipelines by accurately extracting and formatting semantically rich content, including text, tables, and structural elements.
- Processes documents with multi-modal intelligence: File formats like PDFs, PowerPoint presentations, and other document formats, unlocking new efficiencies for AI-driven knowledge extraction of text, tables, and understanding document features.
Technical overview
The model is built upon a vision transformer (ViT-H) vision encoder with an mBART-based decoder, optimized for efficiency and accuracy. Here’s what makes it unique:
Model architecture
NeMo Retriever Parse is a 900M parameter model built using a 600M parameter ViT-H model for encoding visual elements, with a 250M parameter mBART-based decoder, optimized for both efficiency and accuracy. Key architectural features include:
- NVIDIA C-RADIO framework for high-performance vision-language modeling.
- Adaptive compression layers that reduce the latent space from 13,184 tokens to 3,200 tokens.
- 10-block mBART transformer decoder for structured text reconstruction.
- Galactica-based tokenizer for high-quality document tokenization.
Unlike other approaches that rely on lightweight encoders and heavy decoders, NeMo Retriever Parse uses a heavy vision encoder and a light decoder. This enables the model to deeply understand complex document layouts and semantics for fast, efficient extraction in an autoregressive manner.

Tokenization
Nemo Retriever Parse distinguishes itself from traditional document processing pipelines by adopting an end-to-end methodology that integrates text extraction, layout analysis, and semantic classification by using VLM architecture.
A key technical innovation is its unified tokenization scheme. The underlying tokenizer, specialized for the text domain, is augmented with dedicated special tokens, enabling NeMo Retriever Parse to represent not only the extracted text but also the corresponding bounding box coordinates and semantic classes.
These spatial (<x_{coordinate}>, <y_{coordinate}>) tokens, representing discrete coordinates predicted within a normalized grid relative to the input image dimensions, and semantic (<class_{category}>) tokens are directly interleaved within the output sequence, ordered according to the document’s canonical reading flow. This enables NeMo Retriever Parse to generate a rich, structured output stream containing textual, spatial, and semantic information simultaneously, departing from multi-stage or separate output approaches.
Training
The training of NeMo Retriever Parse employs a two-step regimen designed to foster its versatile capabilities. It first undergoes large-scale pre-training on arXiv-5M, a high-information dataset providing rich annotations (formatted text, bounding boxes, semantic classes).
This is followed by fine-tuning on a diverse corpus, including arXiv-5M, human-annotated samples, and publicly available datasets often having only partial annotations. Strategic blending during fine-tuning is critical: the prompt-controlled target output format (e.g., text-only, text+bbox, text+bbox+class) is dynamically adjusted based on dataset annotation availability. This teaches the model to handle varying information density requirements, enhancing robustness across diverse documents and output specifications.
Lastly, multi-token training (MTT) is integrated. By training the decoder to predict ‘n’ subsequent tokens per step, this approach compels the model’s internal representation to develop a more robust predictive state that effectively tracks the dependencies required for structured sequence generation.
This includes implicitly tracking the expected next token, which is crucial for maintaining the precise interleaving and canonical reading order of text, spatial, and semantic tokens within the output stream. This enhanced internal tracking significantly improves the model’s ability to follow the document structure and maintain coherence compared to conventional single-token prediction.
Learn more.
Input and output properties
NeMo Retriever Parse processes RGB images as input. The output consists of structured text with bounding boxes and class attributes, enabling comprehensive document understanding.
Training and accuracy evaluation
NeMo Retriever Parse has been rigorously trained using human-labeled, synthetic, and auto-labeled datasets, ensuring robust accuracy across diverse document types. Extensive benchmarking on both public and internal datasets demonstrates its effectiveness in real-world applications.
Try NeMo Retriever Parse in the NVIDIA API catalog.
Text extraction benchmark
For the text extraction task, NeMo Retriever Parse was evaluated on two key benchmarks that assess the quality and accuracy across various document types and layouts: the General OCR Theory (GOT) Dense OCR Benchmark and the NVIDIA Internal Document OCR Benchmark.
The evaluation metrics employed include the F1 score, which balances precision and recall. 100 normalized edit distance (NED) evaluates the accuracy of the textual reading order.METEOR accounts for alignment, stemming, and synonyms.BLEU measures n-gram overlap.
NeMo Retriever Parse demonstrates exceptional performance in text extraction across the GOT Dense OCR Benchmark and the NVIDIA Internal Document OCR Benchmark. On the GOT benchmark, which involves densely packed, complexly formatted text in high-resolution documents, NeMo Retriever Parse achieves near-perfect scores across all fidelity metrics, showcasing its ability to handle intricate typeset content.


Table extraction benchmark
For the table extraction task, NeMo Retriever Parse was evaluated on two established benchmarks: PubTabNet and RD-TableBench.
PubTabNet
PubTabNet is a large dataset for image-based table recognition, containing over 568,000 images of tables extracted from scientific publications. Each table image is annotated with its corresponding HTML representation. The benchmark evaluates models on their ability to recognize and reconstruct table structures, using metrics like TEDS and S-TEDS. Here, TEDS measures table recognition accuracy by converting LaTeX tables to HTML and calculating the normalized Tree Edit Distance between predicted and ground-truth tables. S-TEDS quantifies structural similarity by counting the minimal node edits needed to transform one tree into another.

NeMo Retriever Parse achieved a TEDS score of 80.20 and an S-TEDS score of 92.20, largely surpassing the popular model on table extraction. These figures indicate NeMo Retriever Parse’s enhanced capability in both accurately recognizing table content and precisely reconstructing its underlying structure.
RD-TableBench
RD-TableBench is an open benchmark designed to evaluate extraction accuracy for complex tables in documents. It features 1,000 manually annotated images from sources, like scanned tables, handwritten content, multiple languages, and merged cells, with accuracy measured using hierarchical alignment and Levenshtein distance.

NeMo Retriever Parse shows a significant advantage in table extraction accuracy on the RD-TableBench when compared to the popular document extractor. This superior accuracy underscores NeMo Retriever Parse’s enhanced capability in correctly extracting both content and structure, especially from the challenging and diverse table formats included in the RD-TableBench.
Key takeaways
NVIDIA NeMo Retriever Parse is the VLM-based OCR solution that enables enterprises with cutting-edge technology to handle complex challenges in document understanding and gather insights.
- Near-lossless text extraction: NeMo Retriever Parse demonstrates near-lossless text extraction with a minimal edit distance and high semantic fidelity, as evidenced by the metrics.
- Accuracy: The overall accuracy of NeMo Retriever Parse is highly competitive, given its comprehensive balance between text and table extraction fidelity
- Superior table extraction: In table extraction, particularly on large-scale benchmarks such as PubTabNet, it outperforms the closest competitor by a notable margin, reinforcing its position as the optimal solution for complex document analysis tasks.
- Structured document segmentation: By predicting semantic classes (e.g., headers, footers, list items), the model preserves reading order and hierarchy across multi-page, multi-column documents, enabling coherent, structured outputs for retrievers and LLMs.
By closely examining these detailed benchmarks, technical personnel, researchers, and developers can conclude that NeMo Retriever Parse offers a balanced, high-accuracy option for both text and table extraction, making it the best option for mission-critical document processing workflows.
Looking ahead
NeMo Retriever Parse is more than just a text extraction model—it’s a step toward the future of document AI. By seamlessly bridging the gap between raw documents and intelligent AI systems, it empowers organizations to extract, structure, and utilize information with greater efficiency. Currently focused on English, it’s being expanded to support Chinese and handwritten documents for broader applicability. Extending context length will enable deeper and more advanced document understanding.
Try NVIDIA NeMo Retriever Parse VLM to advance your document intelligence.
Download VLM NIM from the NGC Catalog.
Contributors: Ilia Karmanov, Amala Sanjay Deshmukh, Lukas Voegtle, Philipp Fischer, Kateryna Chumachenko, Timo Roman, Jarno Seppänen, Andrew Tao, Karan Sapra