Agentic AI / Generative AI

Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety

Mar 24, 2026

By Chintan Patel, Maryam Motamedi, Chris Alexiuk, Moon Chung and Isabel Hulseman

Discuss (0)

AI-Generated Summary

Dislike

At GTC 2026, NVIDIA introduced the Nemotron 3 familya unified stack of specialized models including Nemotron 3 Super for long-context reasoning, Nemotron 3 Content Safety for multimodal moderation, VoiceChat for real-time speech interaction, and Nano Omni (upcoming) for enterprise-grade multimodal understanding, all designed for scalable agentic AI systems.
Nemotron 3 Super employs a hybrid Mamba-Transformer MoE architecture with NVFP4 precision on Blackwell GPUs, achieving high throughput and efficiency for multi-agent tasks, while Nemotron 3 Content Safety delivers low-latency, accurate safety moderation across multiple languages and modalities.
NVIDIA NeMo tools, such as the NeMo Evaluator and Agent Toolkit, enable robust benchmarking and end-to-end optimization of agentic AI systems, allowing developers to build, evaluate, and deploy scalable, trustworthy digital assistants with open models and recipes.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale, developers need models that can understand real-world multimodal data, converse naturally with users globally, and operate safely across languages and modalities.

At GTC 2026, NVIDIA introduced a new generation of NVIDIA Nemotron models designed to work together as a unified agentic stack:

NVIDIA Nemotron 3 Super for long-context reasoning and agentic tasks
NVIDIA Nemotron 3 Ultra (coming soon) for highest reasoning accuracy and efficiency among open frontier models
NVIDIA Nemotron 3 Content Safety for multimodal, multilingual content moderation
NVIDIA Nemotron 3 VoiceChat (in early access) for low latency, natural, full-duplex voice interactions
NVIDIA Nemotron 3 Nano Omni (coming soon) for enterprise-grade multimodal understanding
NVIDIA Nemotron RAG for generating embeddings for image and text modalities with NVIDIA Llama Nemotron Embed VL and for reordering image-or-text candidates when relevance depends on visual content with NVIDIA Llama Nemotron Rerank VL

Together with open data, training recipes, and NVIDIA NeMo tools, the Nemotron family of models provides an end-to-end toolkit to build, evaluate, and optimize production-grade agentic AI systems.

This blog explores the latest Nemotron 3 models, their performance, and how developers can use them to build scalable, multimodal, and real-time AI agents.

Power multi-agent systems with NVIDIA Nemotron 3 Super

Multi-agent systems suffer from “context explosion” with massive token histories 15x that of standard chat and a “thinking tax” with chain-of-thought reasoning for every decision. NVIDIA Nemotron 3 Super is an open hybrid mixture-of-experts (MoE) model that activates just 12B parameters per pass, delivering high accuracy and efficiency for a fraction of the compute.

A hybrid architecture with Mamba and Transformer layers, multi‑token prediction, and NVFP4 precision on NVIDIA Blackwell GPUs delivers up to 5x higher throughput than the previous generation while reducing memory footprint and cost. A configurable “thinking budget” lets developers bound chain‑of‑thought to keep latency and spend predictable, even for continuous agent workloads.

With a 1M-token context window and reinforcement learning across 10+ environments, Nemotron 3 Super excels at coding, math, instruction following, and function-calling, making it ideal for multi-agent applications—with significantly higher throughput on Blackwell when running in NVFP4.

Nemotron 3 Super uses latent MoE to call four expert specialists for the inference cost of only one, compressing tokens before they reach the experts.

External evaluations back this up. On the Artificial Analysis Intelligence Index for open‑weight models under 250B parameters, Nemotron 3 Super NVFP4 ranks among the top models, matching the highest intelligence scores from leading alternatives.

In the intelligence‑versus‑efficiency plot, Nemotron 3 Super lands in the most attractive upper‑right quadrant—combining strong task performance with high output throughput per GPU—making it a compelling choice for cost‑sensitive production agents.

Nemotron 3 Super—with open weights, open training data, and open development recipes—is ideal for software development, deep research, cybersecurity, and the financial services industry.

Keep agents safe with Nemotron 3 Content Safety

As agents expand from text‑only to multimodal workflows, safety guardrails must evolve across inputs, retrieval, and outputs. They must also be applicable in use cases like enterprise copilots and user-generated content (think dating apps or social media), and detect prompt injection in agentic systems such as healthcare, where self-harm is a concern.

Nemotron 3 Content Safety is a compact 4B‑parameter multimodal safety model that detects unsafe or sensitive content across text and images. Built on the Gemma‑3‑4B backbone with an adapter‑based classification head, it delivers high‑accuracy safety classification at low latency that’s ideal for production agentic pipelines. It fuses visual and language features to produce a simple safe/unsafe decision, with optional granular category labels. A quick keyword toggle lets developers choose between fast binary classification and full taxonomy reporting, supporting both low‑latency paths and deeper inspection.

On a suite of multimodal, multilingual safety benchmarks, Nemotron 3 Content Safety reaches approximately 84% accuracy, outperforming alternative safety models across the same tasks while keeping latency low enough for in‑line moderation in production pipelines.

The model uses the same 23‑category taxonomy as Aegis 1–3, covering classes such as hate, harassment, violence, sexual content, plagiarism, and unauthorized advice. Trained on high‑quality Aegis datasets and human‑annotated real‑world images—rather than primarily synthetic data—the model performs strongly across multimodal benchmarks in its 12 supported languages, with solid zero‑shot generalization beyond them.

Natural conversations with Nemotron 3 VoiceChat

Traditional voice AI relies on cascaded pipelines, automatic speech recognition (ASR), a large language model (LLM), and text-to-speech (TTS)—all of which introduce latency, complexity, and multiple points of failure.

Nemotron 3 VoiceChat is a 12B-parameter end-to-end speech model for full-duplex, real-time conversational AI, currently in early access. Unlike cascaded stacks, VoiceChat directly analyzes audio input and generates audio output in a unified and streaming LLM architecture. Using this single model eliminates multi-model orchestration. Built on the Nemotron Nano v2 LLM backbone with Nemotron speech (Parakeet encoder) and TTS decoder, VoiceChat delivers natural, interruptible conversations with low latency.

This model, in its early-access stage, has landed in the most attractive upper right quadrant of the Artificial Analysis Speech to Speech leaderboard. The graphic below plots conversational dynamics against speech reasoning performance, where Nemotron 3 VoiceChat lands in the highlighted upper‑right quadrant, alongside NVIDIA PersonaPlex, a full duplex, 7B-parameter research model. This means developers get both responsive turn‑taking behavior and strong reasoning over audio; both are critical for assistants that must sound natural and stay on task.

With a streamlined end-to-end pipeline, VoiceChat targets sub-300ms end-to-end latency, processing 80ms audio chunks faster than real-time. A single model means fewer points of failure, reduced technical debt, and easier deployment for conversational agents in healthcare, financial services, telecommunications, gaming, and more.

Understand the world with NVIDIA Nemotron 3 Omni

Agentic systems increasingly need to understand real-world data in different formats: video, audio, documents, UI screens, and reason across modalities. Existing solutions are either closed source or face compliance challenges for global enterprise deployment.

NVIDIA Nemotron 3 Nano Omni is the first open, production-ready native omni-understanding foundation model delivering high-context video reasoning enhanced through audio transcription. Nano Omni is powered by NVIDIA Nemotron speech (Parakeet encoder), state-of-the-art optical character recognition (OCR) reasoning with a Nemotron 3 Nano language backbone, and NVIDIA’s first GUI-trained system for real agentic applications.

The architecture uses 3D convolution layers (Conv3D) for efficient handling of temporal-spatial data in video, and efficient video sampling (EVS) enables processing of longer videos at the same computational cost by identifying and pruning temporally static patches. Stay tuned for release updates about this model.

Improve multimodal search with Llama Nemotron Embed VL and Rerank VL

Agentic RAG pipelines rely on retrieval to ground generation on evidence, not just prompts. But enterprise data lives in PDFs with charts, scanned contracts, tables, and slide decks—formats that text-only retrieval misses entirely.

Llama Nemotron Embed VL and Llama Nemotron Rerank VL are compact multimodal models that enable accurate visual document retrieval while remaining compatible with standard vector databases. On the ViDoRe V3/MTEB Pareto curve, which plots retrieval accuracy versus tokens processed per second on a single NVIDIA H100 GPU, Llama Nemotron Embed VL occupies the Pareto frontier. It delivers competitive or better accuracy at high throughput relative to both open and commercial alternatives.

Llama Nemotron Embed VL is a 1.7B-parameter dense embedding model that encodes page images and text into a single-dimensional vector, with support for Matryoshka embeddings. Built on NVIDIA Eagle—a frontier vision-language model with a Llama 3.2 1B backbone and SigLip2 400M vision encoder—it uses contrastive learning for query-document similarity and enables millisecond-latency search with standard vector databases.

Llama Nemotron Rerank VL is a 1.7B-parameter cross-encoder reranker that scores query-page relevance. When paired with the Llama Nemotron Embed VL model, it further increases accuracy by reranking retrieved text chunks and images.

Evaluate and optimize with NVIDIA NeMo

Building production agents requires not only strong models but also robust tools for evaluation and optimization. NVIDIA NeMo provides tools to evaluate, compare, and tune agentic systems:

NVIDIA NeMo Evaluator, enables robust, reproducible benchmarking with support for agentic evaluation. By providing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions.
NVIDIA NeMo Agent Toolkit is an open source framework for profiling and optimizing agentic systems end-to-end. Bring agents from LangChain, AutoGen, AWS Strands, or other frameworks—without code changes—and get visibility into latency bottlenecks, token costs, and orchestration overhead to ship performant agents at scale.

Start building with Nemotron

Agentic AI is a shift from systems that respond to systems that act. It is a coordinated stack of models, tools, memory, and guardrails that can plan, execute, critique, and adapt. If it’s just a bigger model in the same chat window, it’s not agentic.

The Nemotron family of models, released under the NVIDIA permissive open model licenses, is built for this multi‑model reality. Nemotron 3 Super anchors long‑context reasoning and planning. Nemotron 3 Content Safety watches every step, moderating multimodal inputs, retrieved content, and outputs. Nemotron 3 VoiceChat turns that intelligence into full‑duplex, real‑time conversations. Nemotron 3 Nano Omni (coming soon) gives agents eyes and ears across video, audio, documents, charts, and GUIs. Around them, NeMo tools add retrieval, tool‑calling, evaluation, and judge models so agents can score their own work and improve.

Efficiency is the hidden requirement that makes production viable. Real agents make dozens or hundreds of model calls per task, so Nemotron models are right‑sized and optimized for throughput, latency, and cost. And because they’re open and customizable, teams can tune behaviors, align to their own data, and deploy where their security and compliance teams need them.

With Nemotron and NVIDIA NeMo, you’re getting the building blocks for trustworthy, repeatable, and scalable digital assistants for your production agentic systems.

Get started today:

Download the Nemotron models and datasets from Hugging Face.
Preview and access Nemotron Super here.
Access Nemotron 3 Content Safety here.
Preview and apply for early access to Nemotron 3 VoiceChat here.
Evaluate with NVIDIA NeMo Evaluator
Optimize with NeMo Agent Toolkit.
Evaluate NVIDIA-hosted API endpoints on build.nvidia.com and OpenRouter.

Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Visit the Nemotron developer page for resources to get started. Explore open Nemotron models and datasets on Hugging Face and Blueprints on build.nvidia.com .

Engage with Nemotron livestreams, tutorials, and the developer community on the NVIDIA forum and Discord.

Discuss (0)

About the Authors

About Chintan Patel
Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley.

View all posts by Chintan Patel

About Maryam Motamedi
Maryam Motamedi is a product marketing lead for AI software at NVIDIA. She brings decades of cross-industry experience in media/AdTech, streaming, retail, and telecom. Maryam specializes in translating cutting-edge technology into real-world solutions, helping developers and enterprises build AI-powered applications that redefine how we connect, work, and interact.

View all posts by Maryam Motamedi

About Chris Alexiuk
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.

View all posts by Chris Alexiuk

About Moon Chung
Moon Chung is a senior product marketing manager at NVIDIA specializing in Enterprise AI. She has previously worked for Meta and Adobe, focusing on product strategy, product development, and go-to-market strategy. Moon holds an MBA degree from Duke University’s Fuqua School of Business.

View all posts by Moon Chung

About Isabel Hulseman
Isabel Hulseman is a product marketing manager at NVIDIA, where she focuses on enterprise AI software and the rapidly evolving field of agentic AI. With over five years at NVIDIA and an MBA in Marketing, she specializes in go-to-market strategy and translating complex AI technologies into clear, actionable value for developers and enterprise teams. Her work is centered on enabling organizations to build AI agents that are specialized, well-governed, and ready for real-world deployment.

View all posts by Isabel Hulseman