Enterprises generate and store vast amounts of unstructured data in documents like legal documents, sales documents, statement of work, delivery notices, research reports, clinical trial data, business contracts, financial statements, and technical manuals. Extracting meaningful insights from this data remains a challenge for traditional optical character recognition (OCR) technologies that struggle with complex layouts, structural variability, and maintaining continuity across pages.

Accurately classifying page elements like headers, footers, and body content is essential for preserving structure across multi-page documents. Tables, charts, and mathematical formulas, as well as nested content, also require a structural understanding beyond basic text recognition. Plus, wide-ranging document densities—from large reports to formatted letters—further complicate OCR processing. These challenges highlight the need for layout-aware, intelligent models that understand documents and reliably preserve meaning, structure, and reading order at scale.

A transformer-based VLM for high-precision document understanding

NVIDIA Nemotron Parse 1.1 overcomes the shortcomings of OCR technologies. Designed to tackle the hardest aspects of document intelligence, Nemotron Parse 1.1 is an optimized model built on cutting-edge vision language model (VLM) technology.

It delivers advanced text and table extraction, and document semantic understanding with spatial grounding, transforming structured and unstructured documents into actionable data. It’s part of the Nemotron Parse family of microservices for building multimodal ingestion and retrieval pipelines with high accuracy and maximum data privacy.

At its core, Nemotron Parse 1.1 is a transformer-based vision-encoder-decoder model designed for high-precision document understanding. Its VLM architecture enables seamless extraction of structured text while preserving document layout, semantic classes, and reading order.

Key capabilities include:

Accurate text and formula extraction in reading order.
Handwriting recognition
Preservation of layout, reading order
Semantic segmentation, spatial localization and classification of document elements, such as titles, section headers, text, list items, page headers, page footers, captions, tables, figures, formulas, bibliography, table of contents, and footnotes.
Support for plain text and markdown output formats.
Seamless integration with enterprise extraction, curation and RAG pipelines for improved searchability and organization.

By bridging raw documents with intelligent AI-driven processing, Nemotron Parse 1.1 can improve how businesses and researchers interact with their data.

Transforming document AI to enhance downstream retrieval pipelines

The digital world thrives on structured knowledge. Whether it’s scientific research, legal contracts, healthcare documents, insurance forms or enterprise reports, document intelligence is crucial for information accessibility and decision-making. Nemotron Parse 1.1 transforms document AI by:

Improves extraction and retrieval accuracy: Enhances extraction and retrieval pipelines by classifying and segmenting document components accurately. Nemotron Parse 1.1 uses bounding boxes to retain document layout and classify content types (e.g., headers, paragraphs, captions), ensuring structured, context-aware text extraction.
Structured content curation: Boosts large language model (LLM) and VLM accuracy with high-quality, structured text curation. Nemotron Parse 1.1 enriches curator pipeline for training datasets and inference pipelines by accurately extracting and formatting semantically rich content, including text, tables, and structural elements.
Processes documents with multi-modal intelligence: Extract text, tables, and understanding document features from file formats like PDFs, PowerPoint presentations, and other document formats, unlocks new efficiencies for AI-driven knowledge.

Technical overview

The model is built upon a vision transformer (ViT-H) vision encoder with an mBART-based decoder, optimized for efficiency and accuracy. Here’s what makes it unique:

Model architecture

Nemotron Parse 1.1 is a 900M parameter model built using a 600M parameter ViT-H model for encoding visual elements, with a 250M parameter mBART-based decoder, optimized for both efficiency and accuracy. Key architectural features include:

NVIDIA C-RADIO framework for high-performance vision-language modeling.
Adaptive compression layers that reduce the latent space from 13,184 tokens to 3,201 tokens.
10-block mBART transformer decoder for structured text reconstruction.
Galactica-based tokenizer for high-quality document tokenization.

Unlike other approaches that rely on lightweight encoders and heavy decoders, Nemotron Parse 1.1 uses a heavy vision encoder and a light decoder. This enables the model to deeply understand complex document layouts and semantics for fast, efficient extraction in an autoregressive manner.

Additionally, the architecture is modified to not rely on positional embeddings in the decoder, following the NoPE (no positional encoding) approach, which allows to significantly extend the length of the decoded text and therefore perform well on dense and long documents.

Architecture of NeMo Retriever Parse model. — *Figure 2. Architecture of the 900M parameter Nemotron Parse 1.1 model*

Tokenization

Nemotron Parse 1.1 distinguishes itself from traditional document processing pipelines by adopting an end-to-end methodology that integrates text extraction, layout analysis, and semantic classification by using VLM architecture.

A key technical innovation is its unified tokenization scheme. The underlying tokenizer, specialized for the text domain, is augmented with dedicated special tokens, enabling Nemotron Parse 1.1 to represent not only the extracted text but also the corresponding bounding box coordinates and semantic classes.

These spatial (<x_{coordinate}>, <y_{coordinate}>) tokens, representing discrete coordinates predicted within a normalized grid relative to the input image dimensions, and semantic (<class_{category}>) tokens are directly interleaved within the output sequence, ordered according to the document’s canonical reading flow. This enables Nemotron Parse 1.1 to generate a rich, structured output stream containing textual, spatial, and semantic information simultaneously, departing from multi-stage or separate output approaches.

Training

The training of Nemotron Parse 1.1 employs a two-step regimen designed to foster its versatile capabilities. It first undergoes large-scale pre-training on NVpdftex, a high-information dataset providing rich annotations (formatted text, bounding boxes, semantic classes).

This is followed by fine-tuning on a diverse corpus, including NVpdftex, human-annotated samples, and publicly available datasets often having only partial annotations. Strategic blending during fine-tuning is critical: the prompt-controlled target output format (e.g., text-only, text+bbox, text+bbox+class) is dynamically adjusted based on dataset annotation availability. This teaches the model to handle varying information density requirements, enhancing robustness across diverse documents and output specifications.

Lastly, multi-token training (MTT) is integrated. By training the decoder to predict ‘n’ subsequent tokens per step, this approach compels the model’s internal representation to develop a more robust predictive state that effectively tracks the dependencies required for structured sequence generation.

This includes implicitly tracking the expected next token, which is crucial for maintaining the precise interleaving and canonical reading order of text, spatial, and semantic tokens within the output stream. This enhanced internal tracking significantly improves the model’s ability to follow the document structure and maintain coherence compared to conventional single-token prediction.

Learn more.

Input and output properties

Nemotron Parse 1.1 processes RGB images as input. The output consists of structured text with bounding boxes and class attributes, enabling comprehensive document understanding.

Training and accuracy evaluation

Nemotron Parse 1.1 has been rigorously trained using human-labeled, synthetic, and auto-labeled datasets, ensuring robust accuracy across diverse document types. Extensive benchmarking on both public and internal datasets demonstrates its effectiveness in real-world applications.

The new model is now available in Hugging Face, with vLLM support, and you can also try Nemotron Parse 1.1 in the NVIDIA API catalog.

Text extraction benchmark

For the text extraction task, Nemotron Parse 1.1 was evaluated on the General OCR Theory (GOT) Dense OCR Benchmark in order to assess the quality and accuracy across various document types and layouts.

The evaluation metrics employed include the F1 score, which balances precision and recall. 100 normalized edit distance (NED) evaluates the accuracy of the textual reading order.METEOR accounts for alignment, stemming, and synonyms.BLEU measures n-gram overlap.

Nemotron Parse 1.1 demonstrates exceptional performance in text extraction across the GOT Dense OCR Benchmark. On the GOT benchmark, which involves densely packed, complexly formatted text in high-resolution documents, Nemotron Parse 1.1 achieves near-perfect scores across all fidelity metrics, showcasing its ability to handle intricate typeset content.

On the GOT benchmark, which involves densely packed, complexly formatted text in high-resolution documents, Nemotron Parse achieves near-perfect scores across all fidelity metrics, showcasing its ability to handle intricate typeset content. — *Figure 3. Evaluation of Nemotron Parse 1.1 on the GOT dense OCR benchmark across key text extraction metrics*

Table extraction benchmark

For the table extraction task, Nemotron Parse 1.1 was evaluated on two established benchmarks: PubTabNet and RD-TableBench.

PubTabNet

PubTabNet is a large dataset for image-based table recognition, containing over 568,000 images of tables extracted from scientific publications. Each table image is annotated with its corresponding HTML representation. The benchmark evaluates models on their ability to recognize and reconstruct table structures, using metrics like TEDS and S-TEDS. Here, TEDS measures table recognition accuracy by converting LaTeX tables to HTML and calculating the normalized Tree Edit Distance between predicted and ground-truth tables. S-TEDS quantifies structural similarity by counting the minimal node edits needed to transform one tree into another.

Nemotron Parse achieved a TEDS score of 81.37 and an S-TEDS score of 93.99, largely surpassing the popular model on table extraction. — *Figure 4. Results of PubTabNet table recognition benchmark*

Nemotron Parse 1.1 achieved a TEDS score of 81.37 and an S-TEDS score of 93.99, largely surpassing the popular model on table extraction. These figures indicate Nemotron Parse 1.1’s enhanced capability in both accurately recognizing table content and precisely reconstructing its underlying structure.

RD-TableBench

RD-TableBench is an open benchmark designed to evaluate extraction accuracy for complex tables in documents. It features 1,000 manually annotated images from sources, like scanned tables, handwritten content, multiple languages, and merged cells, with accuracy measured using hierarchical alignment and Levenshtein distance.

Nemotron Parse 1.1 shows a significant advantage in table extraction accuracy on the RD-TableBench when compared to the popular document extractor. This superior accuracy underscores Nemotron Parse 1.1’s enhanced capability in correctly extracting both content and structure, especially from the challenging and diverse table formats included in the RD-TableBench.

Key takeaways

NVIDIA Nemotron Parse 1.1 is the VLM-based OCR solution that enables enterprises with cutting-edge technology to handle complex challenges in document understanding and gather insights.

Near-lossless text extraction: Nemotron Parse 1.1 demonstrates near-lossless text extraction with a minimal edit distance and high semantic fidelity, as evidenced by the metrics.
Accuracy: The overall accuracy of Nemotron Parse 1.1 is highly competitive, given its comprehensive balance between text and table extraction fidelity.
Superior table extraction: In table extraction, particularly on large-scale benchmarks such as PubTabNet, it outperforms the closest competitor by a notable margin, reinforcing its position as the optimal solution for complex document analysis tasks.
Structured document semantic segmentation: By predicting semantic classes (e.g., headers, footers, list items), the model preserves reading order and hierarchy across multi-page, multi-column documents, enabling coherent, structured outputs for retrievers and LLMs.

By closely examining these detailed benchmarks, technical personnel, researchers, and developers can conclude that Nemotron Parse 1.1 offers a balanced, high-accuracy option for both text and table extraction, making it the best option for mission-critical document processing workflows.

Looking ahead

Nemotron Parse 1.1 is more than just a text extraction model—it’s a step toward the future of document AI. By seamlessly bridging the gap between raw documents and intelligent AI systems, it empowers organizations to extract, structure, and utilize information with greater efficiency. Currently focused on English, it’s being expanded to support other languages and handwritten documents for broader applicability. Extending context length will enable deeper and more advanced document understanding.

Try NVIDIA Nemotron Parse 1.1 to advance your document intelligence.

Download model from Hugging face or VLM NIM from the NGC Catalog.

Here is a document intelligence video tutorial of how to use Nemotron Parse 1.1 in an extraction pipeline in tandem with NVIDIA Nemotron Nano 2 VL for extracting other vision features. Get started with GitHub notebooks.

Contributors: Ilia Karmanov, Amala Sanjay Deshmukh, Lukas Voegtle, Philipp Fischer, Kateryna Chumachenko, Timo Roman, Jarno Seppänen, Andrew Tao, Karan Sapra

Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model.
Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com.
Share your ideas and vote on features to help shape the future of Nemotron.
Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord

Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron

Note: This blog was updated on November 18, 2025, to reflect the product name change from NeMo Retriever Parse to Nemotron Parse 1.1.

Turn Complex Documents into Usable Data with VLM, NVIDIA Nemotron Parse 1.1

A transformer-based VLM for high-precision document understanding

Transforming document AI to enhance downstream retrieval pipelines