Agentic AI / Generative AI

Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models 

A decorative image.


Agentic AI
is an ecosystem where specialized language and vision models work together. They handle planning, reasoning, retrieval, and safety guardrailing.

Developers need specialized AI agents for domain-specific workflows, real-world deployment, and compliance. Building specialized AI requires four critical ingredients: open models that can be fine-tuned, robust datasets, recipes for optimum model accuracy and compute, and efficient inference for deploying them at scale.

At NVIDIA GTC DC, we’re unveiling reasoning, vision-language, retrieval-augmented generation (RAG), and safety models with open data and recipes that deliver accuracy, compute efficiency, and openness.

This blog covers the features, performance, and tutorials on using the new Nemotron models for building multimodal agents, RAG pipelines, and AI with content safety. 

The image shows the new NVIDIA Nemotron models launched at GTC DC. This includes models for document intelligence, video understanding, multilingual content safety, and information retrieval.
Figure 1. New Nemotron models for document intelligence, video understanding, multilingual content safety, and information retrieval

Enable agents to think efficiently with NVIDIA Nemotron Nano 3

NVIDIA Nemotron Nano 3 is an efficient and accurate 32B parameter MoE with 3.6B active parameters designed for developers to build specialized agentic AI systems. Available soon, this model delivers higher throughput compared to similarly-sized dense models, enabling it to explore a larger search space, do better self-reflection, and provide higher accuracy across scientific reasoning, coding, math, and tool-calling benchmarks. Additionally, the MoE architecture reduces compute costs and latency.

Add multimodal understanding and reasoning with NVIDIA Nemotron Nano 2 VL

NVIDIA Nemotron Nano 2 VL, a leading model on OCRBenchV2, is an open 12B multimodal reasoning model for document intelligence and video understanding. It enables AI assistants to extract, interpret, and act on information across text, images, tables, and videos. This makes the model valuable for agents focused on data analysis, document processing, and visual understanding in applications like generating reports, curating videos, and dense captioning for media asset management and retrieval-augmented search. 

Video 1. Building multimodal AI agents for document and video intelligence using NVIDIA Nemotron VLMs

At its core, this vision-language model (VLM) features a hybrid Mamba-Transfomer architecture delivering on-par accuracy, high token throughput, and low latency for efficient large-scale reasoning for visual and text tasks. This model is trained on the Nemotron VLM Dataset V2 with over 11M high-quality samples covering several tasks such as image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. Read more about the dataset. We used FP8 for faster speed and context parallelism to manage longer inputs, leading to greater efficiency and accuracy for video and long-document tasks.

The bar chart shows accuracy of Nemotron Nano VL and Nemotron Nano 2 VL models across visual benchmarks for multi-image understanding, document intelligence, and video captioning.
Figure 2. Nemotron Nano 2 VL delivers improved accuracy across visual benchmarks for multi-image understanding, document intelligence, and video captioning

This model introduces the Efficient Video Sampling (EVS) method that identifies and prunes temporally static patches in video sequences. EVS reduces token redundancy, preserving essential semantics, for the model to process longer clips and deliver results more swiftly.

The line graph shows accuracy of two video benchmarks across various levels of tokens dropped with EVS. The graphs stay largely flat in terms of accuracy and slope down slightly after 50% token drops.
Figure 3. EVS enables Nemotron Nano 2 VL to achieve up to 2.5x higher throughput without sacrificing accuracy

Quantized for FP4, FP8, and BF16, this model is supported by vLLM and TRT-LLM inference engines and is available as an NVIDIA NIM. Developers can use the NVIDIA AI Blueprint for video search and summarization (VSS) to analyze long videos and NVIDIA NeMo to curate multimodal datasets and customize or build their own models. The technical report also guides developers on the models for building custom, optimized models with Nemotron techniques.

Improve document intelligence with NVIDIA Nemotron Parse 1.1

We’re also releasing NVIDIA Nemotron Parse 1.1, a compact 1B parameter VLM-based document parser for enhanced document intelligence. Given an image, this model extracts structured text and tables with bounding boxes and semantic classes, enabling downstream applications such as improved retriever accuracy, richer large language model (LLM) training data, and improved document processing pipelines.

he bar chart shows accuracy comparison of Nemotron Parse 1.1 with a leading open popular model. The Nemotron model delivers significant accuracy improvements on PubTabNet benchmark, designed to evaluate image-based table recognition.
Figure 4. Nemotron Parse 1.1 delivers leading accuracy on the PubTabNet benchmark for image-based table recognition

Nemotron Parse delivers comprehensive text, tables, and layout understanding for use in retriever and curator workflows. Its extraction datasets and structured outputs support both LLM and VLM training, and boost inference accuracy for VLMs at runtime.

Ground agents with open RAG models

NVIDIA Nemotron RAG is a suite of models for building RAG pipelines and real-time business insights. It ensures data privacy and connects securely to proprietary data across environments, supporting enterprise-grade retrieval. As a core component of NVIDIA AI-Q and the NVIDIA RAG Blueprint, Nemotron RAG provides a scalable and production-ready foundation for intelligent, retrieval-based AI applications.

It enables the development of a wide range of applications—from multi-agent systems where AI agents perceive, plan, and act to achieve complex goals, to generative co-pilots powered by specialized large language models that assist with IT support, HR operations, and customer service. It also supports AI assistants that interact naturally with developers using company data and summarization tools that create written reports or visual media highlights.

The embedding models have consistently led on industry leaderboards like ViDoRe and MTEB for visual and multimodal retrieval, MMTEB for multilingual text retrieval, making them well-suited for building best-in-class RAG pipelines. The new models are now available on Hugging Face.

Video 2. Developing custom AI agents powered with information retrieval using NVIDIA Nemotron RAG

Make AI safer with the Llama 3.1 Nemotron Safety Guard

As developers build agentic AI systems that can reason, retrieve, and act autonomously, safety becomes essential to prevent harmful or unintended behavior. LLMs can be misused, prompted into unsafe outputs, or miss cultural nuance—especially in non-English contexts—making reliable moderation models critical to responsible development.

The new Llama 3.1 Nemotron Safety Guard 8B V3 is a multilingual content safety model. It’s fine-tuned on the Nemotron Safety Guard dataset, a culturally diverse dataset with more than 386K samples covering 23 regionally adapted safety categories, including examples of adversarial and jailbreak prompts within each category.

The model detects unsafe or policy-violating content in both prompts and responses across 23 safety categories and nine languages, such as Arabic, Hindi, and Japanese. Figure 4 illustrates our model’s performance comparison on a per-language basis. 

Bar chart comparing Llama 3.1 Nemotron Safety Guard’s performance across multiple languages.
Figure 5. A comparison of the Llama 3.1 Nemotron Safety Guard model performance across languages

The model achieves 84.2% harmful content classification accuracy with minimal latency, as seen in Figure 5. Two novel techniques power its performance: 1) LLM-driven cultural adaptation aligns prompts and responses with local idioms and sensitivities, and 2) consistency filtering removes noisy or misaligned samples for high-quality fine-tuning.

Bar chart showing average scores of 4 safety models being tested across 8 datasets, 23 safety categories, and 8 languages and their average harmful content classification accuracy.
Figure 6. In benchmark testing across eight datasets, the Llama 3.1 Nemotron Safety Guard model delivers best-in-class performance across 23 safety categories

Lightweight and deployable on a single GPU or as an NVIDIA NIM, it integrates with NeMo Guardrails for real-time, multilingual content safety in agentic AI pipelines. Explore the model and dataset on HuggingFace or build.nvidia.com to start building safer, globally aligned AI systems.

Video 3. Power AI with culturally-aware LLM guardrails using Nemotron Safety Guard

Evaluate your models and optimize AI agents with NVIDIA NeMo

To ensure LLM capabilities are measured reliably, the NVIDIA NeMo Evaluator SDK was recently open sourced. This SDK enables reproducible benchmarking, giving developers confidence in real-world performance beyond reported scores. 

NeMo Evaluator can now also assess models on dynamic, interactive workflows with support for ProfBench, a benchmark suite designed to evaluate agentic AI behaviors, including multi-step reasoning and tool usage. 

By open-sourcing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions. 

NeMo Agent Toolkit is an open-source framework integrated with industry standards like MCP and compatible with other frameworks, including Semantic Kernel, Google ADK, LangChain, and CrewAI. The toolkit’s new Agent Optimizer feature automatically tunes key hyperparameters—LLM type, temperature, max tokens—and optimizes for accuracy, groundedness, latency, token usage, and custom metrics. This reduces trial-and-error and accelerates agent, tool, and workflow development. 

Try it now with our GitHub notebook.

Start building your AI with Nemotron now 

In this blog post, we’ve introduced the newest members of the Nemotron family and a small sample of what is possible with them.

To get started, download the Nemotron models and datasets from Hugging Face. 

Nemotron Nano 2 VL is also hosted by inference providers including Baseten, Deep Infra, Fireworks, Hyperbolic, Nebius, and Replicate to provide  an efficient path from development to production for agentic AI.

You can also evaluate the NVIDIA-hosted API endpoints on build.nvidia.com and OpenRouter.

Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Discuss (0)

Tags