Agentic AI / Generative AI

New NVIDIA Llama Nemotron Nano Vision Language Model Tops OCR Benchmark for Accuracy

An illustration for NVIDIA Llama Nemotron Nano VL.

Jun 03, 2025

By Amanda Saunders, Padmavathy Subramanian, Annie Surla, Karan Sapra and Andrew Tao

Discuss (0)

AI-Generated Summary

Dislike

Llama Nemotron Nano VL is a multimodal vision language model developed by NVIDIA that can read, understand, and analyze various document types with high precision and efficiency, making it a powerful tool for enterprises to automate document processing.
The model excels in tasks such as question answering, text and table processing, chart and graph parsing, and infographic and diagram interpretation, and has achieved industry-leading performance on the OCRBench v2 benchmark.
Llama Nemotron Nano VL can be used for various use cases, including invoice and receipt processing, compliance document analysis, contract and legal document review, and healthcare and insurance automation, and is available as an NVIDIA NIM API and for download from Hugging Face.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Documents such as PDFs, graphs, charts, and dashboards are rich sources of data that, when extracted and organized, provide informative decision-making insights. From automating financial statement processing to improving business intelligence workflows, intelligent document processing is becoming a core component of AI solutions in enterprises.

Organizations can accelerate the AI development process with NVIDIA Llama Nemotron Nano VL. This multimodal vision language model reads, understands, and analyzes many document types with high precision and efficiency.

Setting a new benchmark for document understanding, this production-ready model is designed for scalable AI agents that read and extract insights from multimodal documents with unmatched speed, bringing vision language models (VLMs) to the forefront of enterprise data processing.

Introducing Llama Nemotron Nano VL for best-in-class document understanding

Llama Nemotron Nano VL, the newest member of the NVIDIA Nemotron family, is an advanced AI model designed specifically for advanced intelligent document processing and understanding. Available as an NVIDIA NIM API and for download from Hugging Face, this model extracts diverse information from complex documents, such as PDFs, graphs, charts, tables, diagrams, and dashboards, with precision—all on a single GPU.

By integrating cutting-edge multi-modal capabilities, Llama Nemotron Nano VL excels at multi-image understanding, specializing in intelligent document processing to ensure enterprises can quickly surface critical insights from their business documents.

Whether it’s answering questions, extracting tables, or understanding visual elements like diagrams, Llama Nemotron Nano VL is optimized to handle a wide range of document-level understanding tasks, including:

Question answering (Q/A)
Text and table processing
Chart and graph parsing
Infographic and diagram interpretation

With the efficiency focus of this model, enterprises can deploy sophisticated document understanding systems without incurring high infrastructure costs.

Achieve high accuracy document intelligence with VLMs

The value of Llama Nemotron Nano VL is proven through rigorous benchmark testing, particularly with OCRBench v2. This comprehensive benchmark tests optical character recognition (OCR) and document understanding across a broad range of real-world scenarios.

OCRBench v2 closely mirrors documents commonly found in the finance, healthcare, legal, and government sectors that enterprises process daily, such as invoices, receipts, and contracts. These results are highly relevant for businesses seeking automation in document analysis and demonstrate the exceptional accuracy of Llama Nemotron Nano VL in text spotting, element parsing, and table extraction.

OCRBench v2 benchmark dataset encompasses the following capabilities and associated tasks in Figure 1.

Benchmark results: a new standard for intelligent document processing

The Llama Nemotron Nano VL OCRBench V2 benchmark results reflect the performance of NVIDIA open source models enhanced by NVIDIA tools and expertise for delivering cutting-edge AI technologies. Customizing Llama-3.1 8B with NeMo Retriever Parse data, and adding the C-RADIO vision transformer, enables Llama Nemotron Nano VL to excel at parsing text and extracting meaningful insights from complex visual layouts. By combining these technologies, Llama Nemotron Nano VL delivers high performance in intelligent document processing, making it a powerful tool for enterprises looking to automate and scale their document processing operations.

OCRBench v2 and OCR evaluation

OCRBench v2 is an advanced benchmark that tests OCR and document understanding capabilities in VLMs. Its comprehensive evaluation framework ensures that models are rigorously tested on tasks that resonate with real-world enterprise use cases, such as:

Invoice and receipt processing
Compliance document analysis
Contract and legal document review
Banking and financial statement automation
Healthcare and insurance document processing
Finance statements, trend analysis

OCRBench v2’s dataset includes 10,000 human-verified question-answer pairs for a nuanced assessment of model performance across many document types. With 31 real-world scenarios covered, OCRBench v2 ensures that the models tested on it can handle the diverse and complex challenges typically faced in enterprise document processing workflows.

Industry-leading performance built on best-in-class NVIDIA research

The first NVIDIA Nemotron VLM is the result of years of effort from NVIDIA research. Several key factors, including the following, contribute to the Llama Nemotron Nano VL’s industry-leading performance.

High-quality data for document intelligence, which builds on NeMo Retriever Parse, a VLM-based OCR solution. This model provides capabilities in text and table parsing, along with grounding, enabling the Llama Nemotron Nano VL to perform at an industry-leading level in document understanding tasks.
High-quality multi-modal datasets are critical for Llama Nemotron Nano VL to perform well on document understanding and also work as a general VLM. For the VLM to generalize to the real world, we build upon high-quality datasets and tools developed by VILA, Eagle, and NVLM research teams.
Efficient infrastructure is critical for training foundational models. Llama Nemotron Nano VL was trained using both NVIDIA Megatron modeling and Energon dataloader technology.
Strong foundational vision encoding based on the C-RADIO v2 vision encoder. This is a cutting-edge vision transformer developed using advanced multi-teacher distillation techniques. This approach combines the strengths of several leading AI models to create an efficient and robust system that excels at understanding complex visual content. C-RADIO v2 is designed to handle high-resolution images, diagrams, charts, and tables—even when quality varies—ensuring reliable extraction of visual information from complex documents.

Llama Nemotron Nano VL excels in tasks like text recognition and visual reasoning, and demonstrates advanced chart and diagram understanding capabilities. It surpasses competing VLMs on critical document-oriented tasks such as chart comprehension, diagram reasoning, and OCR, underscoring its robust performance in complex document analysis. For businesses, this means faster, more accurate document processing at scale.

Get started with Llama Nemotron Nano VL

The release of Llama Nemotron Nano VL represents a breakthrough in intelligent document processing, providing developers with the tools they need to automate document processing at scale. With benchmark-breaking performance on OCRBench v2, advanced VLM capabilities, and industry-leading efficiency, this model is the ideal solution for enterprises looking to harness AI in their document workflows.

Get started using Llama Nemotron Nano VL for your own AI applications, with the following resources:

Llama Nemotron Nano VL NIM API preview: Dive into the capabilities of Llama Nemotron Nano VL by exploring the API preview on build.nvidia.com.
Hands-on notebook for invoice and receipt intelligent document processing: Start building your document understanding solutions with a practical, hands-on notebook that demonstrates how to extract information from invoices and receipts.

Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model.
Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com.
Share your ideas and vote on features to help shape the future of Nemotron.
Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord.

Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron.

Discuss (0)

About the Authors

About Amanda Saunders
Amanda Saunders is a seasoned leader in generative AI product marketing at NVIDIA. Within the Enterprise AI product group, her focus is centered on crafting compelling narratives for generative AI and LLM solutions such as NVIDIA NeMo. She focuses on helping customers understand how to deploy generative AI in production and leverage the latest state of the art tools to deliver the high performance solutions. Prior to NVIDIA, Amanda worked with numerous cloud, networking, and desktop virtualization products to show the value that these solutions bring to the enterprise.

View all posts by Amanda Saunders

About Padmavathy Subramanian
Padmavathy Subramanian is a director of generative AI products at NVIDIA, building vision language models. She previously worked in NVIDIA DRIVE AV SW products and intelligent video analytics platform in NVIDIA autonomous machines products.

View all posts by Padmavathy Subramanian

About Annie Surla
Annie Surla is a Developer Advocate Engineer at NVIDIA responsible for developing and presenting a wide range of deep learning software products. She comes with experience working in deep learning applications including vision and NLP. She holds a master’s degree in Engineering Management from Duke University.

View all posts by Annie Surla

About Karan Sapra
Karan Sapra is a senior research scientist on the Applied Deep Learning Research team at NVIDIA. He graduated with his Ph.D. in Computer Engineering from Clemson University. His research interests include deep learning, graph theory, and genomic networks. He has also previously worked on research in the field P2P networks, cloud computing, and high-performance computing.

View all posts by Karan Sapra

About Andrew Tao
Andrew Tao is a distinguished engineer and manager of the Computer Vision team in the Applied Deep Learning Research group at NVIDIA. He received his masters degree in Electrical Engineering from Stanford University in 1992 with an emphasis on Computer Architecture. At NVIDIA, he has worked as a CPU hardware engineer, GPU hardware engineer and architect, and director of applied architecture. Previously, he led a number of computer vision teams in the automotive sector.

View all posts by Andrew Tao

Use Case	Impact of advanced intelligent document processing
Invoice and receipt processing	Automating the extraction of key data points like line items, totals, and dates from invoices and receipts for accounting, expense management, and enterprise resource planning (ERP) integration.
Compliance and identity document analysis	Extracting structured data from documents like passports, ID cards, and tax forms for know your customer (KYC) and regulatory compliance.
Contract and legal document review	Parsing contracts and legal agreements to identify key clauses, obligations, and dates for risk assessment and contract management.
Healthcare and insurance automation	Processing medical records and insurance forms to extract patient data, claim information, and policy details for healthcare administration and insurance claims.
Customer service	Summarizing charts and dashboards, extracting the right content from a long product manual, explaining assembly steps, and correlating text to visual features such as graphs in the dashboard.
Scientific and technical document parsing	Extracting tables, diagrams, and formulas from scientific papers and technical reports to aid in research and knowledge management.
Banking and financial statement automation	Automating data extraction from bank statements, mortgage forms, and pay stubs for financial analysis and loan processing.
Retail catalog management	Summarizing charts and dashboards, extracting the right content from a long product manual, explaining assembly steps, and correlating text to visual features such as graphs in the dashboard.

New NVIDIA Llama Nemotron Nano Vision Language Model Tops OCR Benchmark for Accuracy

Introducing Llama Nemotron Nano VL for best-in-class document understanding

Achieve high accuracy document intelligence with VLMs

Benchmark results: a new standard for intelligent document processing

OCRBench v2 and OCR evaluation

Industry-leading performance built on best-in-class NVIDIA research

Top intelligent document processing use cases for Llama Nemotron Nano VL

Get started with Llama Nemotron Nano VL

Tags

About the Authors

New NVIDIA Llama Nemotron Nano Vision Language Model Tops OCR Benchmark for Accuracy

Introducing Llama Nemotron Nano VL for best-in-class document understanding

Achieve high accuracy document intelligence with VLMs

Benchmark results: a new standard for intelligent document processing

OCRBench v2 and OCR evaluation

Industry-leading performance built on best-in-class NVIDIA research

Top intelligent document processing use cases for Llama Nemotron Nano VL

Get started with Llama Nemotron Nano VL

Tags

About the Authors

Comments