Agentic AI / Generative AI

Train Small Orchestration Agents to Solve Big Problems

Decorative image.

Using the right tool and model for a task is a challenging and ever-present engineering problem in agent design. At NVIDIA Research, we’re making fast progress toward automating it away with an approach that trains and uses a separate model, which we call an “orchestrator”, to act as a supervisor over all of the other models and tools. 

The orchestrator’s job is to consider the task in the context of user preferences (do they want the result fast, cheap, with the highest level of accuracy possible, or some combination of these?) and then manage other models and call on tools in the task-solving conversation to reach the goal. Crucially, as it turns out, small models are already powerful enough to handle this burden if tuned appropriately.

While it may be surprising to employ large models subordinate to small models, the arrangement plays to their advantages. Small models are unburdened by excessive knowledge and trained to capture the essence of problem-solving due to their limited size.

To build orchestrators, we introduce ToolOrchestra, our flagship method, which involves data preparation, synthetic data generation, multi-objective reinforcement-learning training, and comprehensive evaluation of orchestration methods and models.

Diagram showing how an AI orchestrator coordinates tools and models to answer a user’s query efficiently. The Orchestrator uses multi-turn reasoning and calls basic tools, specialized LLMs, and generalist LLMs, optimizing for outcome, efficiency, and cost preference through reinforcement learning.
Figure 1. Overview of the orchestrator: when given a task, it alternates between reasoning and tool calling in multiple turns to solve it

Why train an orchestrator?

You might be wondering: “Using an orchestrator is an intriguing concept, but why should I train a model for it? Wouldn’t it be enough to just edit the prompts of my agent to act as an orchestrator?” The short answer is no. The reason ToolOrchestra-trained orchestrators trump other methods lies in the training objectives. During training, the orchestrator generates experimental trajectories. Some solve the problem better than others. Some reach the correct solution cheaply and quickly, while others make extensive use of expensive tools and take a long time to come up with a conclusion. ToolOrchestra’s reinforcement-learning setup explicitly rewards high model problem-solving accuracy, low cost, and short time-to-solution according to the cost preferences for the given problem. 

What are the results of using an orchestrator?

To demonstrate the effectiveness of ToolOrchestra, we trained a small model, Orchestrator-8B, to tackle some of the most difficult tasks available, including the problems of the Humanity’s Last Exam, FRAMES, and τ2-Bench.

We then give out-of-the-box monolithic LLMs, prompted orchestrators running on frontier LLMs, and Orchestrator-8B access to the same tools, and measure their performance. The outcome is shown in Table 1. Summarized, Orchestrator-8B outperforms all its competitors regardless of their size or advertised level of capabilities while incurring the smallest cost and problem-solving latency.

ToolsModel(s)HLE (↑)FRAMES (↑)τ²-Bench (↑)Cost (↓)Latency (↓)

Existing reported
SOTA  
GPT-535.284.2‡
o324.368.4
GPT-4o5.343.8


No tool    
Qwen3-8B3.224.2–*0.20.6
Llama-Nemotron-49B3.625.6–*0.41.1
Llama-3.3-70B3.832.4–*0.51.4
Qwen3-235B-A22B5.234.3–*2.63.3
Claude Opus 4.111.758.2–*27.48.2
GPT-523.466.3–*6.24.1


Basic tools   
Qwen3-8B4.726.540.71.32.2
Llama-Nemotron-49B6.828.223.22.53.5
Llama-3.3-70B4.642.317.62.84.3
Qwen3-235B-A22B14.039.552.912.310.2
Claude Opus 4.119.863.546.076.232.5
GPT-535.174.077.730.219.8


Basic tools,
specialized LLMs, generalist LLMs  
Qwen3-8B30.668.972.327.618.3
Llama-Nemotron-49B25.857.966.725.617.1
Llama-3.3-70B19.752.455.819.713.4
Qwen3-235B-A22B32.874.275.629.721.2
Claude Opus 4.134.672.876.852.525.6
GPT-521.257.562.317.813.6
Orchestrator-8B37.176.380.29.28.2
Table 1. A comparison of Orchestrator-8B with baselines

To drive the point of Orchestrator-8B’s efficiency home, we measured the accuracy and cost of leading frontier models and the Orchestrator-8B while restricting the model’s reasoning and acting to 10, 20, 50, and 100 conversational turns. The outcome is visualized in the figure below. We observed that regardless of the conversational length limit imposed on the competing systems, Orchestrator-8B always outperforms its competition while maintaining a lower dollar cost.

Scatter plot showing HLE Accuracy (%) versus Cost ($) for multiple LLMs. Orchestrator-8B achieves higher accuracy than other models at the same cost and maintains the same quality at a lower cost. GPT-5 and Grok-4 perform well but are more expensive, while Claude Opus 4.1, Qwen3-235B-A22B, and Llama-3.3-70B have lower accuracy. The plot highlights Orchestrator-8B’s superior performance-cost efficiency compared to SOTA baselines.
Figure 2. Orchestrator-8B compared with several advanced LLMs in terms of cost and HLE accuracy

How to train an orchestrator?

To train an orchestrator for your own purposes while following the ToolOrchestra method, you’ll need a model, some data, and our training code.

To show how little is needed to build an orchestrator for challenging tasks, such as the hard benchmarks we tested Orchestrator-8B on, we used Qwen3-8B as our underlying model, generated only 552 synthetic problems, and used only 1,296 prompts in training.

Step 1: Choose the underlying model

The choice of the model to train for an effective orchestrator is entirely up to you. We recommend you pick the smallest language model aligned with the nature of your agent. NVIDIA Nemotron Nano, the Qwen 3 family, or the xLAM family are just a few of the options.

Step 2: Prepare and generate data

The good news about the data for ToolOrchestra is that you really don’t need much to get started. The tool assumes that much of the data will be synthetically generated. We describe the data generation process in detail in our paper. In broad terms, you’ll want to start with a description or a few examples of your agent problem-solving with its preferred tools. Using large models, you can then generate many more similar synthetic tasks. 

The following is a sketch of the code that can be used to generate samples similar to the ones used to train Orchestrator-8B. 

def generate_samples(domain):
    subjects = generate_subjects(domain)
    schema = generate_schema(subjects)
    data_model = generate_datamodel(schema)
    database = generated_database(domain,schema,data_model)
    tools = generate_tools(domain,database)
    tasks = generate_tasks(database,tools)
    return tasks
samples = generate_samples()
...

You can jump right in and experience the real data generation magic.

Step 3: Start training

Once equipped with your model choice and some data, you can directly use or adapt ToolOrchestra’s released code to train your own orchestrator. This sketch can get you started (more details can be found in the repository README.)

train_dataset = prepare_data(raw_examples,tools)
train_dataloader = DataLoader(train_dataset)
reward_model = RewardManager(config)
trainer = RayTrainer(config,reward_model)
trainer.init_workers()
trainer.start()
...

You can kick off your own training run and watch your orchestrator come to life! 

Step 4: Visualize your progress

ToolOrchestra’s training code supports direct logging through wandb. The following shows example visualizations from Orchestrator-8B’s runs.

Side-by-side line charts of training metrics. The left chart shows actor policy gradient loss decreasing and stabilizing around -2.5 over 150 Side-by-side line charts of training metrics
steps. The right chart shows critic mean score increasing and plateauing around 2.0, indicating training convergence and performance improvement.
Figure 3. Training loss and critic score of Orchestrator-8B

The benefits of orchestration

Engineering efficient, high-performance agents today involves a constant struggle to balance capability and cost. Developers must manually weigh every choice (model size, tool use, query length, reasoning depth), knowing that one wrong call can push costs skyward or compromise the quality of the result. This complexity scales unforgivingly as the number of queries that need to be engineered grows, making cost-aware agent optimization one of the most challenging and time-intensive aspects of building real-world AI systems.

ToolOrchestra changes that. By training small orchestrators to direct large models and tools with surgical precision and based on need, we automate this balancing act in a way that outperforms monolithic LLMs and prompted orchestrator setups across accuracy, latency, and dollar cost.

Orchestrator-8B, our example-trained model, is a concrete example demonstrating that the right strategy can beat brute model-size scaling or prompt-engineering dexterity. It delivers state-of-the-art performance on hard benchmarks while using resources far more efficiently. In short, orchestration enables agents to be both powerful and nimble.

Looking ahead: The rise of compound AI systems

It has been the dominant paradigm of the AI sphere over the past few years that intelligence is first built into large foundational models by training and then specialized for real-world use cases through in-context learning. This belief is increasingly under attack, as the AI community continues to produce more and more examples of compound AI systems outperforming the capabilities of monolithic LLMs while being safer, faster, and more cost-effective.

ToolOrchestra represents our first step toward fundamentally intelligent compound AI systems as a paradigm emerging to replace AI monoliths. It is further aligned with our long-term position that small language models are ultimately the key to scalable agentic AI. 

To learn more:

Discuss (0)

Tags