Agentic AI / Generative AI

Build Efficient AI Agents Through Model Distillation With the NVIDIA Data Flywheel Blueprint

Jun 11, 2025

By Daniel Glogowski and Sylendran Arunagiri

Discuss (0)

AI-Generated Summary

Dislike

The NVIDIA AI Blueprint for building data flywheels enables teams to continuously distill large language models into smaller, cheaper, and faster models without compromising accuracy.
The Data Flywheel Blueprint uses NVIDIA NeMo microservices, including the Flywheel Orchestrator Service, to automate the execution of structured experiments and surface promising efficient candidates for production promotion.
A fine-tuned Llama-3.2-1B model achieved 98% of the tool-calling accuracy of the original 70B model, offering substantial improvements in latency and cost.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As enterprise adoption of agentic AI accelerates, teams face a growing challenge of scaling intelligent applications while managing inference costs. Large language models (LLMs) offer strong performance but come with substantial computational demands, often resulting in high latency and costs.

At the same time, many development workflows—such as evaluation, data curation, and fine-tuning—remain largely manual. These processes are time consuming, difficult to automate, and don’t scale effectively.

Adding to the complexity, AI agents increasingly rely on multiple specialized models for tasks like reasoning, tool routing, and summarization. Each of these components has distinct performance characteristics and optimization requirements—making it difficult to evaluate and tune them individually at scale.

To address this, NVIDIA is introducing the NVIDIA AI Blueprint for building data flywheels, a reference architecture built on the NVIDIA NeMo microservices. This blueprint enables teams to continuously distill LLMs into smaller, cheaper, and faster models without compromising accuracy, using real-world production traffic from the AI agent interactions. It automates the execution of structured experiments that explore the space of available models, surfacing promising efficient candidates for production promotion or deeper manual evaluation.

This blog unpacks the Data Flywheel Blueprint—how it works, how to apply it to a real-world use case around agentic tool calling, and how to easily configure it to build data flywheels for your own agentic AI workflows. This demo notebook puts the flywheel in action on a pre-curated customer service agent dataset.

Join our team on June 18 for a live webinar where our experts break down how the NVIDIA NIM and NeMo microservices power data flywheels. Interact with the NVIDIA blueprint product team at our upcoming livestream Q&A session on June 26 to dive into building data flywheels with ease using this latest blueprint.

How does it work?

This version of the blueprint is designed to help teams replicate the capabilities of large foundational models (e.g., 70B parameters) with smaller, more efficient alternatives. By continuously benchmarking existing and newly released models against production tasks, fine-tuning promising candidates, and surfacing top performing smaller models, the system enables teams to reduce latency and inference costs without compromising model accuracy.

At the heart of the Data Flywheel Blueprint is the Flywheel Orchestrator Service, a unified control plane that abstracts the complexity of interacting directly with NeMo microservices. Acting as the brain of the flywheel system, the Orchestrator API coordinates the data flywheel job by leveraging a suite of modular NeMo microservices:

NVIDIA NeMo Customizer to handle lightweight LoRA-based fine-tuning
NVIDIA NeMo Evaluator to automate evaluations using LLM-as-a-judge scoring
NeMo Datastore to manage structured datasets and artifacts
NeMo Deployment Manager to spin up and serve candidate models (NIMs) dynamically for inference

Here’s how the flywheel works, step by step:

Step 1: Log ingestion

Production prompt/response logs from the teacher base-NIM (e.g., a large 70B model) are ingested into an Elasticsearch index. These logs follow the OpenAI-compliant format.

Step 2: Tagging for partitioning

Each log is tagged with metadata like workload_id, enabling the system to isolate and process data by task for each agent node.

Step 3: Dataset creation

The orchestrator de-duplicates the logs and transforms them into task-aligned datasets for training and evaluation. Notably, these datasets do not rely on any external ground-truth labels—only on the original model’s responses—which eliminates the need for manual labeling. These datasets are stored in the NeMo Datastore.

Step 4: Fine-tuning jobs

Using the NeMo Customizer, supervised fine-tuning jobs are launched with LoRA adapters. Each job distills the knowledge from the larger teacher model stored in the NeMo Datastore into smaller task-specific candidates without requiring handcrafted datasets.

Step 5: Evaluate runs

The NeMo Evaluator benchmarks multiple candidate NIM models using three evaluation methods:

base-eval: Zero-shot baseline on raw logs
icl-eval: Few-shot context-based evaluation
customized-eval: Evaluation post fine-tuning pipelines

Step 6: Scoring and aggregation

Model outputs are scored using the NeMo Evaluator, including LLM-as-a-judge, and results are reported back through the Orchestrator API.

Step 7: Review and promotion

Developers and admins can programmatically access metrics, download artifacts, launch follow-up experiments, or promote top-performing candidates to production to replace the larger NIM.

This entire loop can be scheduled or triggered on-demand, creating an automated, scalable system that continuously and progressively surfaces smaller, faster, and more cost-efficient models—while preserving the accuracy of the larger baseline model.

Adopting the Data Flywheel Blueprint for agentic tool-calling

To demonstrate the value and use of the Data Flywheel Blueprint, we applied it to a high-impact use case: Agentic Tool Calling. This is particularly relevant for production AI agents that must interface reliably with external systems via structured API calls.

We built upon the NVIDIA AI Blueprint for AI virtual assistants, a multi-tool agent designed for customer support tasks like product Q&A, order tracking, returns, and casual conversation. Initially, this agent was powered by a large Llama-3.3-70B-instruct model that performed tool-calling with high accuracy. We wanted to tune a smaller model to match the accuracy of the larger one while reducing costs. In production, this would be driven by real customer interactions to continuously improve tool-calling performance.

To simulate production traffic, we synthetically generated natural language queries and captured the agent’s tool-calling behavior in OpenAI-compatible request-response logs. These logs became the basis for both training and evaluation datasets.

Using this setup, we ran three optimization experiments with the Data Flywheel Blueprint to assess the tool-calling performance of smaller candidate models:

Zero-shot prompting (`base-eval`): The models (including the larger base NIM) were evaluated on production-like prompts without any prior examples or customization. This serves as a gold standard, capturing the original behavior of the large model and providing a reference point against which smaller models are benchmarked.
In-context learning (`icl-eval`): Few-shot examples were prepended to each prompt. These examples came from actual production traffic and were automatically sampled and formatted by the flywheel based on the ICL configuration (config.yaml) that comes with the blueprint. This approach tests how much a model can improve just by contextual prompting.
Supervised fine-tuning with LoRA (`customized-eval`): Using the `training_config` and `lora_config` sections, we fine-tuned models with LoRA adapters, training on curated task-specific datasets derived from production logs. Those models were then evaluated to measure gains over base-eval and icl-eval.

The NeMo Evaluator, using LLM-as-judge capabilities, automatically scored all model outputs without requiring human labels. Each run provided structured metrics like `function_name_and_args_accuracy` and `tool_calling_correctness`, which were logged and surfaced through the Orchestrator API for review and comparison. This whole process has been completely automated so developers can programmatically access evaluation reports to promote top-performing candidates to replace the original NIM.

The outcome: A fine-tuned Llama-3.2-1B model using SFT-LoRA achieved 98% of the tool-calling accuracy of the original 70B model. This was enabled by automated, repeated cycles of fine-tuning and evaluation, as more traffic data flowed through the blueprint.

The result is a dramatically lighter model that retains accuracy while offering substantial improvements in latency and cost. Notably, the optimized Llama-3.2-1B model requires only one GPU to serve, compared to two GPUs needed for the original Llama-3.3-70B.

Check out this demo video on how to launch this blueprint and put these experiments in action.

Configuring and running the blueprint

To get the Data Flywheel Blueprint up and running, you’ll first need to set up your environment and deploy the required services. Detailed instructions can be found under this GitHub repo readme, but the steps broadly involve:

Generating a personal API key to deploy NeMo microservices, access models hosted as NIM, and download model on-premises
Deploying the NeMo microservices platform
Installing and configuring the Data Flywheel Orchestrator

Once the environment is ready, you’ll configure your models and workflows using a config.yaml file. This file defines everything from which NIM models to deploy to configs for fine-tuning (e.g., LoRA) and evaluation (e.g., ICL, LLM-as-a-Judge).

Note: This file loads when the system starts. The settings remain static during a flywheel run. To update anything, you must stop the services, modify the YAML, and redeploy.

Below are the essential steps and configuration blocks you’ll need to review or modify to launch your flywheel job.

Model settings

Specify the NIM to be deployed and evaluated. You can set the model name, context length, GPU count, and other options like customization. See supported models available to fine-tune by the NeMo Customizer.

nims:
  - model_name: "meta/llama-3.2-1b-instruct"
    context_length: 32768
    gpus: 1
    pvc_size: 25Gi
    tag: "1.8.3"
    customization_enabled: true

Fine-tuning

Specify the fine-tuning parameters. These include training type, fine-tuning type, batch size, and the number of training epochs.

training_config:
  training_type: "sft"
  finetuning_type: "lora"
  epochs: 2
  batch_size: 16
  learning_rate: 0.0001

lora_config:
  adapter_dim: 32
  adapter_dropout: 0.1

In-context learning (ICL) settings

Configure how many few-shot examples are used and their context windows when evaluating models via ICL mode.

icl_config:
  max_context_length: 32768
  reserved_tokens: 4096
  max_examples: 3
  min_examples: 1

Evaluation settings

Controls how data is split into validation and evaluation sets for each evaluation job.

data_split_config:
  eval_size: 20
  val_ratio: 0.1
  min_total_records: 50
  random_seed: null
  limit: null

eval_size: Number of examples for evaluation

val_ratio: Ratio of data used for validation

Launching the flywheel job

Once configured, launch the job via a simple API call to the microservice.

# client_id: Identifier of the application or deployment that generated traffic
# workload_id: Stable identifier for the logical task / route / agent node
curl -X POST http://localhost:8000/api/jobs \
  -H "Content-Type: application/json" \
  -d '{"workload_id": "tool_router", "client_id": "support-app"}'

A successful submission will return tool-calling accuracy metrics that can be used to compare performance across various models.

"scores": {
  "function_name_and_args_accuracy": 0.95,
  "tool_calling_correctness": 1
}

Extending the blueprint to custom workflows

The blueprint is a reference workflow that can be easily customized to build data flywheels for any downstream task. NVIDIA has already seen early adoption stories among its partner ecosystem.

Weights & Biases offers a custom version of this data flywheel blueprint on the NVIDIA API catalog augmented with tools for agent traceability and observability, model experiment tracking, evaluation, and reporting.
Iguazio, a machine learning firm acquired by QuantumBlack, AI by McKinsey, has adapted the blueprint to build its own custom data flywheel with AI orchestration and monitoring components to power its AI platform. This is also available as a partner example on the NVIDIA API catalog.
Amdocs has integrated this blueprint into its amAIz platform, integrating LLM fine-tuning and evaluation directly into the CI/CD pipeline. With added automation and enhancements, this enables Amdocs to continuously improve its agent’s accuracy and performance as new base models emerge—while also catching potential issues early in the development cycle.
EY is integrating the blueprint to enhance the EY.ai Agentic Platform with real-time model optimization, which enables self-improving, cost-efficient agents across tax, risk, and finance domains.
VAST is designing its own data flywheels for custom use cases by integrating the VAST AI Operating System with NVIDIA’s Data Flywheel Blueprint. It will enable real-time data collection, enrichment, and feedback across multimodal sources, accelerating the delivery of intelligent AI pipelines for industries such as finance, healthcare, and scientific research.

Build a data flywheel for your use case

Explore the NVIDIA AI Blueprint for data flywheels on the NVIDIA API catalog, and dive into set-up guides, implementation details, and tutorials. Follow the video tutorial for a hands-on walkthrough on building the flywheel for the agentic tool-calling use case covered in the blog.

Developers building agentic workflows with the new NVIDIA NeMo Agent toolkit can seamlessly build data flywheels around their agents using this blueprint and integrate the toolkit’s evaluation and profiling capabilities.

Discuss (0)

About the Authors

About Daniel Glogowski
Daniel Glogowski is a senior technical product manager in the NVIDIA Enterprise product group, working on agentic AI solutions for enterprise applications. Previously, he was a vice president of product at Robust Intelligence (acquired by Cisco), focusing on AI risk and security. He has worked extensively on data-driven platforms, with experience in business intelligence, ML Ops, and operational analytics. Daniel holds an MBA from the MIT Sloan School of Management.

View all posts by Daniel Glogowski

About Sylendran Arunagiri
Sylendran Arunagiri is a technical product marketing manager for the NVIDIA NeMo platform, specializing in agentic AI and generative AI solutions. With hands-on experience in technical product management and AI/ML development, he brings a deep understanding of the field. Sylendran is an alumnus of Carnegie Mellon University, IIM Bangalore, and NIT Surathkal.

View all posts by Sylendran Arunagiri