Generative AI

New NVIDIA Llama Nemotron Nano Vision Language Model Tops OCR Benchmark for Accuracy

An illustration for NVIDIA Llama Nemotron Nano VL.

Documents such as PDFs, graphs, charts, and dashboards are rich sources of data that, when extracted and organized, provide informative decision-making insights. From automating financial statement processing to improving business intelligence workflows, intelligent document processing is becoming a core component of AI solutions in enterprises. 

Organizations can accelerate the AI development process with NVIDIA Llama Nemotron Nano VL. This multimodal vision language model reads, understands, and analyzes many document types with high precision and efficiency. 

Setting a new benchmark for document understanding, this production-ready model is designed for scalable AI agents that read and extract insights from multimodal documents with unmatched speed, bringing vision language models (VLMs) to the forefront of enterprise data processing.

Introducing Llama Nemotron Nano VL for best-in-class document understanding

Llama Nemotron Nano VL, the newest member of the NVIDIA Nemotron family, is an advanced AI model designed specifically for advanced intelligent document processing and understanding. Available as an NVIDIA NIM API and for download from Hugging Face, this model extracts diverse information from complex documents, such as PDFs, graphs, charts, tables, diagrams, and dashboards, with precision—all on a single GPU.

By integrating cutting-edge multi-modal capabilities, Llama Nemotron Nano VL excels at multi-image understanding, specializing in intelligent document processing to ensure enterprises can quickly surface critical insights from their business documents.

Whether it’s answering questions, extracting tables, or understanding visual elements like diagrams, Llama Nemotron Nano VL is optimized to handle a wide range of document-level understanding tasks, including:

  • Question answering (Q/A)
  • Text and table processing
  • Chart and graph parsing
  • Infographic and diagram interpretation

With the efficiency focus of this model, enterprises can deploy sophisticated document understanding systems without incurring high infrastructure costs.

Achieve high accuracy document intelligence with VLMs

The value of Llama Nemotron Nano VL is proven through rigorous benchmark testing, particularly with OCRBench v2. This comprehensive benchmark tests optical character recognition (OCR) and document understanding across a broad range of real-world scenarios.

OCRBench v2 closely mirrors documents commonly found in the finance, healthcare, legal, and government sectors that enterprises process daily, such as invoices, receipts, and contracts. These results are highly relevant for businesses seeking automation in document analysis and demonstrate the exceptional accuracy of Llama Nemotron Nano VL in text spotting, element parsing, and table extraction.

OCRBench v2 benchmark dataset encompasses the following capabilities and associated tasks in Figure 1.

Diagram showing eight testable text-reading capabilities in OCRBench v2. The figure maps each capability to its associated tasks, illustrating the distinct categories of text-reading skills evaluated by the benchmark.
Figure 1.  Overview of eight text-reading capabilities and tasks in OCRBenchV2, with each color indicating a capability type. Image from Chiang et al., LLM-as-a-Judge arXiv:2501.00321

Benchmark results: a new standard for intelligent document processing

The Llama Nemotron Nano VL OCRBench V2 benchmark results reflect the performance of NVIDIA open source models enhanced by NVIDIA tools and expertise for delivering cutting-edge AI technologies. Customizing Llama-3.1 8B with NeMo Retriever Parse data, and adding the C-RADIO vision transformer, enables Llama Nemotron Nano VL to excel at parsing text and extracting meaningful insights from complex visual layouts. By combining these technologies, Llama Nemotron Nano VL delivers high performance in intelligent document processing, making it a powerful tool for enterprises looking to automate and scale their document processing operations.

The OCRBenchV2 leaderboard showing that Llama Nemotron Nano VL performs better than other models.
Figure 2.  OCRBenchV2 leaderboard showing how Llama Nemotron Nano VL performs for text recognition, text referring, and text spotting

OCRBench v2 and OCR evaluation

OCRBench v2 is an advanced benchmark that tests OCR and document understanding capabilities in VLMs. Its comprehensive evaluation framework ensures that models are rigorously tested on tasks that resonate with real-world enterprise use cases, such as:

  • Invoice and receipt processing
  • Compliance document analysis
  • Contract and legal document review
  • Banking and financial statement automation
  • Healthcare and insurance document processing
  • Finance statements, trend analysis

OCRBench v2’s dataset includes 10,000 human-verified question-answer pairs for a nuanced assessment of model performance across many document types. With 31 real-world scenarios covered, OCRBench v2 ensures that the models tested on it can handle the diverse and complex challenges typically faced in enterprise document processing workflows.

Industry-leading performance built on best-in-class NVIDIA research

The first NVIDIA Nemotron VLM is the result of years of effort from NVIDIA research. Several key factors, including the following, contribute to the Llama Nemotron Nano VL’s industry-leading performance.

  • High-quality data for document intelligence, which builds on NeMo Retriever Parse, a VLM-based OCR solution. This model provides capabilities in text and table parsing, along with grounding, enabling the Llama Nemotron Nano VL to perform at an industry-leading level in document understanding tasks.
  • High-quality multi-modal datasets are critical for Llama Nemotron Nano VL to perform well on document understanding and also work as a general VLM. For the VLM to generalize to the real world, we build upon high-quality datasets and tools developed by VILA, Eagle, and NVLM research teams. 
  • Efficient infrastructure is critical for training foundational models. Llama Nemotron Nano VL was trained using both NVIDIA Megatron modeling and Energon dataloader technology.
  • Strong foundational vision encoding based on the C-RADIO v2 vision encoder. This is a cutting-edge vision transformer developed using advanced multi-teacher distillation techniques. This approach combines the strengths of several leading AI models to create an efficient and robust system that excels at understanding complex visual content. C-RADIO v2 is designed to handle high-resolution images, diagrams, charts, and tables—even when quality varies—ensuring reliable extraction of visual information from complex documents. 

Llama Nemotron Nano VL excels in tasks like text recognition and visual reasoning, and demonstrates advanced chart and diagram understanding capabilities. It surpasses competing VLMs on critical document-oriented tasks such as chart comprehension, diagram reasoning, and OCR, underscoring its robust performance in complex document analysis. For businesses, this means faster, more accurate document processing at scale.

Top intelligent document processing use cases for Llama Nemotron Nano VL

Llama Nemotron Nano VL is designed for use cases that require advanced document understanding across many industries. Whether your goal is automating document processing or enhancing business analytics, this model delivers the performance required to build production-ready solutions. 

Key use cases include:

Use CaseImpact of advanced intelligent document processing
Invoice and receipt processingAutomating the extraction of key data points like line items, totals, and dates from invoices and receipts for accounting, expense management, and enterprise resource planning (ERP) integration.
Compliance and identity document analysisExtracting structured data from documents like passports, ID cards, and tax forms for know your customer (KYC) and regulatory compliance.
Contract and legal document reviewParsing contracts and legal agreements to identify key clauses, obligations, and dates for risk assessment and contract management.
Healthcare and insurance automationProcessing medical records and insurance forms to extract patient data, claim information, and policy details for healthcare administration and insurance claims.
Customer serviceSummarizing charts and dashboards, extracting the right content from a long product manual, explaining assembly steps, and correlating text to visual features such as graphs in the dashboard.
Scientific and technical document parsingExtracting tables, diagrams, and formulas from scientific papers and technical reports to aid in research and knowledge management.
Banking and financial statement automationAutomating data extraction from bank statements, mortgage forms, and pay stubs for financial analysis and loan processing.
Retail catalog managementSummarizing charts and dashboards, extracting the right content from a long product manual, explaining assembly steps, and correlating text to visual features such as graphs in the dashboard.
Table 2. Key use cases of Llama Nemotron Nano VL

Get started with Llama Nemotron Nano VL

The release of Llama Nemotron Nano VL represents a breakthrough in intelligent document processing, providing developers with the tools they need to automate document processing at scale. With benchmark-breaking performance on OCRBench v2, advanced VLM capabilities, and industry-leading efficiency, this model is the ideal solution for enterprises looking to harness AI in their document workflows.

Get started using Llama Nemotron Nano VL for your own AI applications, with the following resources:

Discuss (0)

Tags