Agentic AI, the next wave of generative AI, is a paradigm shift with the potential to revolutionize industries by enabling AI systems to act autonomously and achieve complex goals. Agentic AI combines the power of large language models (LLMs) with advanced reasoning and planning capabilities, opening a world of possibilities across industries, from healthcare and finance to manufacturing and logistics.
An agentic AI system combines perception, reasoning, and action to interact with its environment effectively. It gathers information from databases and external sources, analyzes goals and develops strategies to achieve them.
The system’s action module executes decisions, while retaining the memory of past interactions to support long-term tasks and personalized responses. With multi-agent collaboration, agents can share information and coordinate efficiently on complex tasks.
AI agents are also equipped with feedback mechanisms, creating a data flywheel, where the data generated from its interactions is fed into the system to enhance the models. This makes the systems improve operational efficiency and drive better decisions over time.
At the core of these systems are foundation models that provide capabilities, such as language understanding, decisionmaking, reasoning, and instruction following.
Leading LLMs for agentic AI
Today, NVIDIA announced the Llama Nemotron family of agentic AI models that provide highest accuracy on a wide range of agentic tasks, exceptional compute efficiency, and open license for enterprise use. In this post, we dive deeper into how this model family is achieving the leading accuracies across a diverse range of agentic AI tasks.
NVIDIA has been developing leaderboard models for various benchmarks, including Arena Hard for chat, IFEval for instruction following, and BFCL for function calling in its size categories by leveraging NVIDIA tuning techniques.
Simplify and accelerate agentic AI systems to market
NVIDIA is simplifying the development of AI agents by unifying the strengths of these models to provide a single model that supports a diverse range of tasks. Llama Nemotron excels across key agentic tasks, so that a single model can streamline the engineering process by replacing multiple specialized models.
These models can be easily customized with proprietary data to meet specific domain and task requirements using NVIDIA NeMo, aligned to follow instruction and generate human preferred responses using NeMo Aligner, and used to quickly develop AI agents using NVIDIA AI Blueprints, which have NVIDIA NIM and NeMo microservices as building blocks.
The models are also available as portable NIM microservices, providing maximum inference throughput on NVIDIA-accelerated infrastructure.
Optimized for compute efficiency
The Llama Nemotron family is optimized for various compute resources, ensuring optimal performance across different environments:
- Nano: A model optimized for accuracy and performance on NVIDIA RTX AI PCs and workstations, enabling agentic workflows for PC application developers.
- Super: A high-accuracy model offering exceptional throughput on a single GPU.
- Ultra: The highest-accuracy model, designed for data-center-scale applications demanding the highest performance.
Other integral models for agentic AI systems include the following:
- Retrieval: The MTEB leaderboard-topping embedding and reranking models provide the most precise and context-aware responses using proprietary data.
- Reward: Top RewardBench leaderboard model llama-3.1-nemotron-70b-reward provides superior capabilities for evaluating LLMs that are aligned with human preferences.
Curating high-quality data for model alignment
High-quality training data plays a critical role in the accuracy and quality of responses from a custom LLM but robust datasets can be prohibitively expensive and difficult to create.
Synthetic data addresses these challenges by generating large scale data that can be further curated to improve quality. NVIDIA NeMo Curator helps build high-quality multimodal training data by downloading, extracting, cleaning, filtering, deduplicating, and blending the original data at scale.
The recently published state-of-the-art NVIDIA Llama 3.1 Nemotron 70B Instruct model was aligned for human preferences using real-world and synthetic data, the NVIDIA Llama 3.1 Nemotron Reward model, and NeMo Aligner.
Achieving world-class LLM accuracy across benchmarks
NVIDIA is leveraging the Llama family, most popular open models, and NVIDIA’s customization techniques to build state-of-the-art accuracy models for various agentic AI tasks, including instruction following, tool calling, chat, coding, and math.
The models are pruned to reduce latency and improve compute efficiency, then retrained using a hiqh-quality dataset with distillation and alignment methods to increase accuracy across tasks. This results in smaller models with high accuracy and throughput.
The NVIDIA pruning and distillation technique used for the nvidia/Minitron-4B-Base model has a teacher correction step, which converts any models into a teacher model with a custom training data, followed by structured pruning and knowledge distillation. For more information, see How to Prune and Distill Llama-3.1 8B to an NVIDIA Llama-3.1-Minitron 4B Model.
The NVIDIA alignment recipe using NeMo Aligner helped make the model world-class with the SOTA accuracy in instruction following, function calling, and math, which are essential capabilities for agentic systems.
Building efficient LLMs with neural architecture search
Agentic systems must be computationally efficient to handle complex tasks in real time. However, the substantial computational demands of LLMs can hinder their deployment in these complex systems without optimizations that carefully balance performance and resource constraints. Overcoming these challenges necessitates the development of lean, hardware-optimized model architectures that maintain high performance while ensuring practical and scalable deployment.
NVIDIA recently developed the Neural Architecture Search (NAS) methodology and associated training techniques to create transformer models specifically optimized for efficient inference.
NAS represents a transformative approach to designing LLMs for optimized performance on specific hardware platforms. While many LLMs are traditionally built with a uniform structure of repeated, identical blocks, NAS offers a more nuanced approach by exploring extensive design spaces and a wide variety of non-standard transformer blocks:
- Alternative attention mechanisms
- Diverse feed-forward network (FFN) blocks with varying levels of efficiency
- Complete elimination of certain building blocks
A central component of this methodology is block distillation, which enables the efficient training of diverse block variants by using a teacher-student framework. The teacher model provides input-output mappings that the student blocks are trained to mimic.
An algorithm called Puzzle is used to evaluate and rank alternative architectural components, akin to assembling a puzzle where each piece represents a different block variant. This process navigates the vast design space to identify models that balance accuracy with strict constraints like memory and throughput.
- Crafting the puzzle pieces: Applying block-wise local distillation to every alternative sunblock replacement in parallel and scoring its quality and inference cost to build a library of blocks.
- Assembling the puzzle architecture: Using mixed-integer programming to assemble a heterogeneous architecture that optimizes quality under constraints such as throughput, latency, and memory usage.
- Uptraining: The reassembled architecture is trained with global knowledge distillation to strengthen inter-block compatibility.
By incorporating knowledge-distillation (KD) loss during both scoring and training, the methodology reduces the accuracy gap between optimized models and their reference counterparts while requiring only a fraction of the training costs.
After a series of advanced fine-tuning steps with NVIDIA NeMo Aligner, the resulting models offer responses that align with human preferences, generate significant inference throughput speedups on the target NVIDIA GPUs, and provide world-class performance in areas relevant to agentic workloads.
NeMo Aligner is a scalable and efficient toolkit for model alignment that features state-of-the-art algorithms such as reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and SteerLM.
This approach leads to the development of Llama Nemotron Super and Llama Nemotron Ultra models.
Open LLMs
Llama Nemotron models offer a commercially viable solution supported by the NVIDIA open-source license, which enables enterprises to customize these models and align with their use cases and requirements while maintaining control over their data.
The open license also provides the flexibility to deploy these powerful models across diverse environments, whether on-premises, in the cloud, or at the edge, ensuring that businesses can leverage the benefits of Llama Nemotron models in the context that best suits their operational needs and strategic goals.
Getting started
Simplify the development and deployment of custom AI agents that can reason, plan, and take action with new NVIDIA AI Blueprints for agentic AI.
Sign up to get notified about the new Llama Nemotron models when they’re available as NIM microservices using API endpoints. They will be available for download from NVIDIA NGC and Hugging Face and customizable with NVIDIA NeMo.