Federated learning (FL) research often begins with a deceptively simple question: What should we try next? A new aggregation rule, a FedProx coefficient, a server optimizer setting, a SCAFFOLD variant, or a model architecture tweak may all look promising before an experiment starts.
After the run finishes, the harder questions begin: Did the change actually improve the metric? Was the comparison fair? Was the lift worth the runtime? Should the idea be kept, narrowed, or discarded?
This post introduces a new NVIDIA FLARE example that shows how bounded AI agent actions, fixed benchmark contracts, experiment ledgers, literature-grounded recovery, and reproducible reporting can help FL researchers evaluate more ideas more quickly.
What is Auto-FL in NVIDIA FLARE?
NVIDIA FLARE Auto-FL is an automated, AI-driven research loop designed to test and optimize federated learning strategies.
The idea is straightforward: start with a comparable benchmark task, give the agent a clear research control plane, set a fixed training budget, constrain the mutation surface, and record every result in an experiment ledger. From there, the agent can autonomously iterate through candidate FL strategies while preserving the FLARE Client API and Recipe API contracts.
Rather than handing an agent an open-ended research problem, Auto-FL begins with a fair, comparable benchmark: a bounded FL simulation with a fixed training budget and consistent scoring. From that shared baseline, the agent can explore candidate FL strategies within a structured workflow that maintains the protocol’s stability, keeps comparisons measurable, and traces results.
A useful agent-led experiment loop should be constrained enough to avoid breaking the FL contract, measurable enough to compare ideas, stable enough for long-running autonomous campaigns, and detailed enough to turn a completed Auto-FL campaign into a reproducible, sourced report—not just a directory full of logs.
Figure 1 shows an Auto-FL campaign progress from the NVIDIA FLARE CIFAR-10 simulation harness. Each point represents a candidate run recorded in the experiment ledger; gray points are discarded runs, blue points are active candidates, green points are kept runs, and the green step line tracks the best observed cross-site evaluation score over time, with purple lines indicating logged literature-review events.

How does Auto-FL make the research loop explicit?
Coding agents are useful for quickly making complex code changes. FL experiments differ from ordinary local model tuning because the correctness of the experiment depends on a contract among the server, clients, model updates, metadata, data splits, and evaluation logic. A candidate can raise the reported score while quietly changing what is being compared—for example, by altering the evaluation data, model capacity, communication budget, local compute, or server-client update semantics.
Auto-FL makes the research loop explicit. The agent begins with program.md, which acts as the control plane. It then proposes a bounded change, runs the same benchmark budget, extracts a comparable score, appends the result to results.tsv, and uses the ledger to decide which candidate run to keep or discard. The human can interrupt the campaign at any point and analyze the experiment history.
What components does Auto-FL provide?
Auto-FL packages the components needed to run that operating model in a single place. It includes a ready-to-run experimental harness within a task profile. FLARE baseline recipes in job.py, a Client API training loop in client.py, custom FL aggregation hooks and additional model and training utilities, and mutation guardrails. The package also includes run scripts, plotting utilities, templates, and a reporting skill for completed campaigns.
A task profile can define a supported strategy surface with FedAvg, FedOpt-style server updates, FedAdam, SCAFFOLD, median aggregation, and FedProx hooks. Auto-FL can also support bounded architecture search. That matters because architecture search can otherwise turn a comparison of federated algorithms into an uncontrolled model-capacity comparison.
| Component | Category | Role |
program.md | Main entry point | Agent-facing research control plane |
job.py and client.py | Task profile | FLARE Recipe API and Client API harness for FL experimentation |
custom_aggregators.py | Task profile | FedAvg, FedOpt/FedAdam, SCAFFOLD, median, and related hooks |
mutation_schema.yaml | Task profile | Bounded mutation surface for agent changes |
results.tsv | Ledger | Experiment ledger for score, runtime, status, target, description, and artifacts |
plot_progress.py | Utility | Progress plot generated from the ledger |
autofl-nvflare | Skill | NVFlare-based Auto-FL harness that follows an autoresearch-style loop |
autofl-nvflare-report | Skill | Post-campaign reporting flow for stopped runs |
How does Auto-FL turn agent-led coding into a controlled experiment workflow?
The most important shift is operational. Auto-FL turns agent-led coding into a controlled experiment workflow. The agent reads the control plane, reviews the literature, proposes a candidate, mutates only the permitted surface, runs the experiment, extracts a score, records the result, and decides whether to keep, narrow, or discard the candidate.
The control plane lives in program.md. The bundled local skill files instruct the agent in the operating rules. This keeps the human in the role of research lead: define the question, set the budget, decide which mutations are allowed, and review the ledger, while the AI agent performs the repetitive work of trying bounded candidate strategies and recording the results.
Figure 2 shows the Auto-FL research loop with literature-grounded stall recovery. The workflow starts from research intent, program.md, an active task profile, a fixed budget, and a bounded mutation surface. Candidate FLARE runs append results to results.tsv; reviewed batches are kept, narrowed, discarded, or used to select the next candidate.
When progress plateaus, the workflow enters a structured literature-review loop that performs source-backed search, extracts challenge cards, filters and scores proposal cards, logs a literature event, and returns contract-safe proposals to the same bounded experiment loop.

What is the function of literature-grounded recovery?
Auto-FL tracks the performance in a ledger (results.tsv). A useful campaign should not continue making small local changes after the ledger shows that a search direction has stalled. Hence, a literature-grounded recovery path has been included for that moment.
The agent uses the ledger to summarize the current best stack, recent candidates, repeated crashes, null or worse ideas, and the active mutation contract. When the run appears to plateau, the workflow shifts from local sweeps to a source-backed literature loop. The goal is to stop guessing, identify what kind of failure mode the campaign is encountering, and return with a small set of contract-safe proposals.
In the literature loop, the agent fills a structured worksheet, searches for relevant methods, extracts challenge cards, creates proposal cards, filters duplicates and previously failed ideas, and scores proposals against expected gain, implementation risk, contract safety, evidence, novelty, and runtime cost. The selected proposals then re-enter the same bounded experiment loop: mutate only the allowed surface, run under the fixed task contract, extract a comparable score, and append the result to the ledger.
What’s included in the final Auto-FL report?
After a human manually stops an Auto-FL campaign, the reporting skill is used on the experiment branch that contains results.tsv. It creates a final progress plot, writes a report, and commits the reporting artifacts.
That final report is the bridge between autonomous iteration and researcher review. It summarizes the baseline and best score, absolute and relative lift, runtime cost, final stack, crash notes, null or worse ideas, and recommended next-step experiments. In the Auto-FL loop, discarded candidates stay visible in the committed ledger, while kept code changes are committed on the experiment branch. The agent and human researcher can use that memory to avoid trying the same low-value idea again.
How to adapt Auto-FL to your datasets and tasks?
Beyond the default CIFAR-10 simulation, the Auto-FL pattern is highly adaptable. By decoupling the primary control plane from the task profile—which specifies the dataset, metrics, and mutation constraints—researchers can apply the same autonomous experiment discipline to various model families without rebuilding the underlying harness.
To demonstrate this flexibility, a medical visual language model (VLM) task is included in this example. This example integrates a federated Qwen3-VL LoRA training workflow into the NVIDIA FLARE client and recipe APIs. The setup simulates three distinct medical data sites: VQA-RAD, SLAKE, and PathVQA. This federated approach focuses on LoRA adapters and uses token-level F1 for evaluation.

Again, the task profile is intentionally bounded. It fixes the site mapping, prompt and evaluation semantics, model reference, adapter rank, data limits, number of rounds, seed policy, final evaluation clients, and runtime cap. Within this contract, the agent can explore task-safe choices such as learning rate, local optimizer steps, site-specific learning-rate scaling, gradient accumulation, FedProx-style regularization, and LoRA aggregation variants.
Using the same Auto-FL skills and main entry point, the agent can improve results for this specific task profile, as shown in Figure 4, compared to zero-shot and baseline performance. Bars show token-F1 on each dataset test split. The Auto-FL gains are concentrated on the harder out-of-distribution sites rather than being uniform across datasets.

Get started with NVIDIA FLARE Auto-FL
Use the Auto-FL research example as a starting point, rather than a fixed scaffold. Start by running the baseline and inspecting the generated ledger. Then adapt the mutation surface and scoring contract to your own FL question, dataset, and task. The pattern is portable: keep the budget fixed, keep the metric comparable, make the mutation surface explicit. You can adapt the concept to other scenarios by adjusting task-specific profiles and scripts, for example, client.py and job.py, with the task profile and mutation schema defining the task details.
Auto-FL with coding agents is not magic. It is a practical scaffold for asking better FL research questions faster. The value comes from the structure around the agent: a control plane, a dedicated literature-review loop, a safe mutation surface, a fixed budget, a comparable score, and a ledger that records every candidate. With those pieces in place, agents can take on much of the repetitive work of FL experimentation while preserving the comparability and reproducibility researchers need.