Agentic AI is an ecosystem where specialized language and vision models work together. They handle planning, reasoning, retrieval, and safety guardrailing.
Developers need specialized AI agents for domain-specific workflows, real-world deployment, and compliance. Building specialized AI requires four critical ingredients: open models that can be fine-tuned, robust datasets, recipes for optimum model accuracy and compute, and efficient inference for deploying them at scale.
At NVIDIA GTC DC, we’re unveiling reasoning, vision-language, retrieval-augmented generation (RAG), and safety models with open data and recipes that deliver accuracy, compute efficiency, and openness.
This blog covers the features, performance, and tutorials on using the new Nemotron models for building multimodal agents, RAG pipelines, and AI with content safety.

Enable agents to think efficiently with NVIDIA Nemotron Nano 3
NVIDIA Nemotron Nano 3 is an efficient and accurate 32B parameter MoE with 3.6B active parameters designed for developers to build specialized agentic AI systems. Available soon, this model delivers higher throughput compared to similarly-sized dense models, enabling it to explore a larger search space, do better self-reflection, and provide higher accuracy across scientific reasoning, coding, math, and tool-calling benchmarks. Additionally, the MoE architecture reduces compute costs and latency.
Add multimodal understanding and reasoning with NVIDIA Nemotron Nano 2 VL
NVIDIA Nemotron Nano 2 VL, a leading model on OCRBenchV2, is an open 12B multimodal reasoning model for document intelligence and video understanding. It enables AI assistants to extract, interpret, and act on information across text, images, tables, and videos. This makes the model valuable for agents focused on data analysis, document processing, and visual understanding in applications like generating reports, curating videos, and dense captioning for media asset management and retrieval-augmented search.
At its core, this vision-language model (VLM) features a hybrid Mamba-Transfomer architecture delivering on-par accuracy, high token throughput, and low latency for efficient large-scale reasoning for visual and text tasks. This model is trained on the Nemotron VLM Dataset V2 with over 11M high-quality samples covering several tasks such as image Q&A, OCR, dense captioning, video Q&A, and multi-image reasoning. Read more about the dataset. We used FP8 for faster speed and context parallelism to manage longer inputs, leading to greater efficiency and accuracy for video and long-document tasks.

This model introduces the Efficient Video Sampling (EVS) method that identifies and prunes temporally static patches in video sequences. EVS reduces token redundancy, preserving essential semantics, for the model to process longer clips and deliver results more swiftly.

Quantized for FP4, FP8, and BF16, this model is supported by vLLM and TRT-LLM inference engines and is available as an NVIDIA NIM. Developers can use the NVIDIA AI Blueprint for video search and summarization (VSS) to analyze long videos and NVIDIA NeMo to curate multimodal datasets and customize or build their own models. The technical report also guides developers on the models for building custom, optimized models with Nemotron techniques.
Improve document intelligence with NVIDIA Nemotron Parse 1.1
We’re also releasing NVIDIA Nemotron Parse 1.1, a compact 1B parameter VLM-based document parser for enhanced document intelligence. Given an image, this model extracts structured text and tables with bounding boxes and semantic classes, enabling downstream applications such as improved retriever accuracy, richer large language model (LLM) training data, and improved document processing pipelines.

Nemotron Parse delivers comprehensive text, tables, and layout understanding for use in retriever and curator workflows. Its extraction datasets and structured outputs support both LLM and VLM training, and boost inference accuracy for VLMs at runtime.
Ground agents with open RAG models
NVIDIA Nemotron RAG is a suite of models for building RAG pipelines and real-time business insights. It ensures data privacy and connects securely to proprietary data across environments, supporting enterprise-grade retrieval. As a core component of NVIDIA AI-Q and the NVIDIA RAG Blueprint, Nemotron RAG provides a scalable and production-ready foundation for intelligent, retrieval-based AI applications.
It enables the development of a wide range of applications—from multi-agent systems where AI agents perceive, plan, and act to achieve complex goals, to generative co-pilots powered by specialized large language models that assist with IT support, HR operations, and customer service. It also supports AI assistants that interact naturally with developers using company data and summarization tools that create written reports or visual media highlights.
The embedding models have consistently led on industry leaderboards like ViDoRe and MTEB for visual and multimodal retrieval, MMTEB for multilingual text retrieval, making them well-suited for building best-in-class RAG pipelines. The new models are now available on Hugging Face.
Make AI safer with the Llama 3.1 Nemotron Safety Guard
As developers build agentic AI systems that can reason, retrieve, and act autonomously, safety becomes essential to prevent harmful or unintended behavior. LLMs can be misused, prompted into unsafe outputs, or miss cultural nuance—especially in non-English contexts—making reliable moderation models critical to responsible development.
The new Llama 3.1 Nemotron Safety Guard 8B V3 is a multilingual content safety model. It’s fine-tuned on the Nemotron Safety Guard dataset, a culturally diverse dataset with more than 386K samples covering 23 regionally adapted safety categories, including examples of adversarial and jailbreak prompts within each category.
The model detects unsafe or policy-violating content in both prompts and responses across 23 safety categories and nine languages, such as Arabic, Hindi, and Japanese. Figure 4 illustrates our model’s performance comparison on a per-language basis.

The model achieves 84.2% harmful content classification accuracy with minimal latency, as seen in Figure 5. Two novel techniques power its performance: 1) LLM-driven cultural adaptation aligns prompts and responses with local idioms and sensitivities, and 2) consistency filtering removes noisy or misaligned samples for high-quality fine-tuning.

Lightweight and deployable on a single GPU or as an NVIDIA NIM, it integrates with NeMo Guardrails for real-time, multilingual content safety in agentic AI pipelines. Explore the model and dataset on HuggingFace or build.nvidia.com to start building safer, globally aligned AI systems.
Evaluate your models and optimize AI agents with NVIDIA NeMo
To ensure LLM capabilities are measured reliably, the NVIDIA NeMo Evaluator SDK was recently open sourced. This SDK enables reproducible benchmarking, giving developers confidence in real-world performance beyond reported scores.
NeMo Evaluator can now also assess models on dynamic, interactive workflows with support for ProfBench, a benchmark suite designed to evaluate agentic AI behaviors, including multi-step reasoning and tool usage.
By open-sourcing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions.
NeMo Agent Toolkit is an open-source framework integrated with industry standards like MCP and compatible with other frameworks, including Semantic Kernel, Google ADK, LangChain, and CrewAI. The toolkit’s new Agent Optimizer feature automatically tunes key hyperparameters—LLM type, temperature, max tokens—and optimizes for accuracy, groundedness, latency, token usage, and custom metrics. This reduces trial-and-error and accelerates agent, tool, and workflow development.
Try it now with our GitHub notebook.
Start building your AI with Nemotron now
In this blog post, we’ve introduced the newest members of the Nemotron family and a small sample of what is possible with them.
To get started, download the Nemotron models and datasets from Hugging Face.
Nemotron Nano 2 VL is also hosted by inference providers including Baseten, Deep Infra, Fireworks, Hyperbolic, Nebius, and Replicate to provide an efficient path from development to production for agentic AI.
You can also evaluate the NVIDIA-hosted API endpoints on build.nvidia.com and OpenRouter.
Stay up to date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
- Share your ideas and vote on features to help shape the future of Nemotron.
- Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord
- Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron