Agentic AI / Generative AI

Accelerating Federated Learning Research with AI Agents and NVIDIA FLARE Auto-FL

Jun 09, 2026

By Holger Roth, Ziyue Xu, Chester Chen and Peter Cnudde

Discuss (1)

AI-Generated Summary

Dislike

NVIDIA FLARE Auto-FL automates federated learning research by constraining agent actions through a control plane (program.md), enforcing fixed benchmark contracts, and using an experiment ledger to ensure reproducibility, comparability, and protocol stability.
The system supports bounded mutation of FL strategies (e.g., FedAvg, FedOpt, SCAFFOLD, FedProx, bounded architecture search), integrates literature-grounded recovery to escape local optima, and packages all experiment componentsincluding harnesses, mutation schemas, and reporting utilitiesin a single workflow.
Auto-FLs modular design enables adaptation to diverse FL tasks and datasets, demonstrated through both CIFAR-10 and federated visual language model experiments, while maintaining rigorous control over mutation scope, evaluation metrics, and experiment documentation.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Federated learning (FL) research often begins with a deceptively simple question: What should we try next? A new aggregation rule, a FedProx coefficient, a server optimizer setting, a SCAFFOLD variant, or a model architecture tweak may all look promising before an experiment starts.

After the run finishes, the harder questions begin: Did the change actually improve the metric? Was the comparison fair? Was the lift worth the runtime? Should the idea be kept, narrowed, or discarded?

This post introduces a new NVIDIA FLARE example that shows how bounded AI agent actions, fixed benchmark contracts, experiment ledgers, literature-grounded recovery, and reproducible reporting can help FL researchers evaluate more ideas more quickly.

What is Auto-FL in NVIDIA FLARE?

NVIDIA FLARE Auto-FL is an automated, AI-driven research loop designed to test and optimize federated learning strategies.

The idea is straightforward: start with a comparable benchmark task, give the agent a clear research control plane, set a fixed training budget, constrain the mutation surface, and record every result in an experiment ledger. From there, the agent can autonomously iterate through candidate FL strategies while preserving the FLARE Client API and Recipe API contracts.

Rather than handing an agent an open-ended research problem, Auto-FL begins with a fair, comparable benchmark: a bounded FL simulation with a fixed training budget and consistent scoring. From that shared baseline, the agent can explore candidate FL strategies within a structured workflow that maintains the protocol’s stability, keeps comparisons measurable, and traces results.

A useful agent-led experiment loop should be constrained enough to avoid breaking the FL contract, measurable enough to compare ideas, stable enough for long-running autonomous campaigns, and detailed enough to turn a completed Auto-FL campaign into a reproducible, sourced report—not just a directory full of logs.

Figure 1 shows an Auto-FL campaign progress from the NVIDIA FLARE CIFAR-10 simulation harness. Each point represents a candidate run recorded in the experiment ledger; gray points are discarded runs, blue points are active candidates, green points are kept runs, and the green step line tracks the best observed cross-site evaluation score over time, with purple lines indicating logged literature-review events.

How does Auto-FL make the research loop explicit?

Coding agents are useful for quickly making complex code changes. FL experiments differ from ordinary local model tuning because the correctness of the experiment depends on a contract among the server, clients, model updates, metadata, data splits, and evaluation logic. A candidate can raise the reported score while quietly changing what is being compared—for example, by altering the evaluation data, model capacity, communication budget, local compute, or server-client update semantics.

Auto-FL makes the research loop explicit. The agent begins with program.md, which acts as the control plane. It then proposes a bounded change, runs the same benchmark budget, extracts a comparable score, appends the result to results.tsv, and uses the ledger to decide which candidate run to keep or discard. The human can interrupt the campaign at any point and analyze the experiment history.

What components does Auto-FL provide?

Auto-FL packages the components needed to run that operating model in a single place. It includes a ready-to-run experimental harness within a task profile. FLARE baseline recipes in job.py, a Client API training loop in client.py, custom FL aggregation hooks and additional model and training utilities, and mutation guardrails. The package also includes run scripts, plotting utilities, templates, and a reporting skill for completed campaigns.

A task profile can define a supported strategy surface with FedAvg, FedOpt-style server updates, FedAdam, SCAFFOLD, median aggregation, and FedProx hooks. Auto-FL can also support bounded architecture search. That matters because architecture search can otherwise turn a comparison of federated algorithms into an uncontrolled model-capacity comparison.

Component	Category	Role
`program.md`	Main entry point	Agent-facing research control plane
`job.py` and `client.py`	Task profile	FLARE Recipe API and Client API harness for FL experimentation
`custom_aggregators.py`	Task profile	FedAvg, FedOpt/FedAdam, SCAFFOLD, median, and related hooks
`mutation_schema.yaml`	Task profile	Bounded mutation surface for agent changes
`results.tsv`	Ledger	Experiment ledger for score, runtime, status, target, description, and artifacts
`plot_progress.py`	Utility	Progress plot generated from the ledger
`autofl-nvflare`	Skill	NVFlare-based Auto-FL harness that follows an autoresearch-style loop
`autofl-nvflare-report`	Skill	Post-campaign reporting flow for stopped runs

Table 1. Key components in Auto-FL

How does Auto-FL turn agent-led coding into a controlled experiment workflow?

The most important shift is operational. Auto-FL turns agent-led coding into a controlled experiment workflow. The agent reads the control plane, reviews the literature, proposes a candidate, mutates only the permitted surface, runs the experiment, extracts a score, records the result, and decides whether to keep, narrow, or discard the candidate.

The control plane lives in program.md. The bundled local skill files instruct the agent in the operating rules. This keeps the human in the role of research lead: define the question, set the budget, decide which mutations are allowed, and review the ledger, while the AI agent performs the repetitive work of trying bounded candidate strategies and recording the results.

Figure 2 shows the Auto-FL research loop with literature-grounded stall recovery. The workflow starts from research intent, program.md, an active task profile, a fixed budget, and a bounded mutation surface. Candidate FLARE runs append results to results.tsv; reviewed batches are kept, narrowed, discarded, or used to select the next candidate.

When progress plateaus, the workflow enters a structured literature-review loop that performs source-backed search, extracts challenge cards, filters and scores proposal cards, logs a literature event, and returns contract-safe proposals to the same bounded experiment loop.

What is the function of literature-grounded recovery?

Auto-FL tracks the performance in a ledger (results.tsv). A useful campaign should not continue making small local changes after the ledger shows that a search direction has stalled. Hence, a literature-grounded recovery path has been included for that moment.

The agent uses the ledger to summarize the current best stack, recent candidates, repeated crashes, null or worse ideas, and the active mutation contract. When the run appears to plateau, the workflow shifts from local sweeps to a source-backed literature loop. The goal is to stop guessing, identify what kind of failure mode the campaign is encountering, and return with a small set of contract-safe proposals.

In the literature loop, the agent fills a structured worksheet, searches for relevant methods, extracts challenge cards, creates proposal cards, filters duplicates and previously failed ideas, and scores proposals against expected gain, implementation risk, contract safety, evidence, novelty, and runtime cost. The selected proposals then re-enter the same bounded experiment loop: mutate only the allowed surface, run under the fixed task contract, extract a comparable score, and append the result to the ledger.

What’s included in the final Auto-FL report?

After a human manually stops an Auto-FL campaign, the reporting skill is used on the experiment branch that contains results.tsv. It creates a final progress plot, writes a report, and commits the reporting artifacts.

That final report is the bridge between autonomous iteration and researcher review. It summarizes the baseline and best score, absolute and relative lift, runtime cost, final stack, crash notes, null or worse ideas, and recommended next-step experiments. In the Auto-FL loop, discarded candidates stay visible in the committed ledger, while kept code changes are committed on the experiment branch. The agent and human researcher can use that memory to avoid trying the same low-value idea again.

How to adapt Auto-FL to your datasets and tasks?

Beyond the default CIFAR-10 simulation, the Auto-FL pattern is highly adaptable. By decoupling the primary control plane from the task profile—which specifies the dataset, metrics, and mutation constraints—researchers can apply the same autonomous experiment discipline to various model families without rebuilding the underlying harness.

To demonstrate this flexibility, a medical visual language model (VLM) task is included in this example. This example integrates a federated Qwen3-VL LoRA training workflow into the NVIDIA FLARE client and recipe APIs. The setup simulates three distinct medical data sites: VQA-RAD, SLAKE, and PathVQA. This federated approach focuses on LoRA adapters and uses token-level F1 for evaluation.

Again, the task profile is intentionally bounded. It fixes the site mapping, prompt and evaluation semantics, model reference, adapter rank, data limits, number of rounds, seed policy, final evaluation clients, and runtime cap. Within this contract, the agent can explore task-safe choices such as learning rate, local optimizer steps, site-specific learning-rate scaling, gradient accumulation, FedProx-style regularization, and LoRA aggregation variants.

Using the same Auto-FL skills and main entry point, the agent can improve results for this specific task profile, as shown in Figure 4, compared to zero-shot and baseline performance. Bars show token-F1 on each dataset test split. The Auto-FL gains are concentrated on the harder out-of-distribution sites rather than being uniform across datasets.

Get started with NVIDIA FLARE Auto-FL

Use the Auto-FL research example as a starting point, rather than a fixed scaffold. Start by running the baseline and inspecting the generated ledger. Then adapt the mutation surface and scoring contract to your own FL question, dataset, and task. The pattern is portable: keep the budget fixed, keep the metric comparable, make the mutation surface explicit. You can adapt the concept to other scenarios by adjusting task-specific profiles and scripts, for example, client.py and job.py, with the task profile and mutation schema defining the task details.

Auto-FL with coding agents is not magic. It is a practical scaffold for asking better FL research questions faster. The value comes from the structure around the agent: a control plane, a dedicated literature-review loop, a safe mutation surface, a fixed budget, a comparable score, and a ledger that records every candidate. With those pieces in place, agents can take on much of the repetitive work of FL experimentation while preserving the comparability and reproducibility researchers need.

For more details, see our paper, Auto-FL-Research: Agentic Search for Federated Learning Algorithms.

Discuss (1)

About the Authors

About Holger Roth
Holger Roth, is a principal federated learning scientist at NVIDIA, specializing in developing distributed and collaborative software and models for various industries using federated learning and analytics. He has been exploring the topic both from theoretical and practical standpoints. During the COVID-19 pandemic, he led the experimentation of a federated learning study involving twenty hospitals around the globe to train more generalizable models for predicting clinical outcomes in symptomatic patients. His other research interests include computer-assisted annotation, active learning, and natural language processing. He served as an Associate Editor for IEEE Transactions of Medical Imaging and holds a PhD from University College London, UK. In 2018, he was awarded the MICCAI Young Scientist Publication Impact Award.

View all posts by Holger Roth

About Ziyue Xu
Ziyue Xu is a senior scientist at NVIDIA. His research interests lie in the area of medical image analysis, computer vision, and federated learning. He has been working on collaborative AI development over the years along with fellow researchers and clinicians. Ziyue received his B.S. from Tsinghua University in 2006, and M.S. and Ph.D. from the University of Iowa in 2009 and 2012, respectively. He is an IEEE Senior Member, Area Chair for major conferences, and Associate Editor for several journals, including IEEE Transactions of Medical Imaging (TMI) and International Journal of Computer Vision (IJCV).

View all posts by Ziyue Xu

About Chester Chen
Chester Chen is a senior manager on the federated learning engineering team at NVIDIA. He has over 20 years of experience in building and managing different types of systems and operations. Before NVIDIA, he spent six years as the director of data science engineering at GoPro, where he was in charge of data lake infrastructure, data engineering, data analytics, and machine learning applications. Before GoPro, he played many different roles, including director of engineering, technical director, and system architect, at many different big companies and small startups in Silicon Valley.

View all posts by Chester Chen

About Peter Cnudde
Peter Cnudde is director of engineering for federated learning at NVIDIA, where he works with customers, partners, and research teams to enable real-world federated learning with NVIDIA FLARE. He has more than 30 years of experience building distributed systems. Before joining NVIDIA, he was vice president of engineering at Yahoo, responsible for Yahoo's big data and machine learning platforms. Earlier in his career, he worked in wireless telecommunications, including roles at Alcatel and RF Micro Devices. He holds a master's degree in electrotechnical engineering from the University of Ghent in Belgium.

View all posts by Peter Cnudde