How to Train Scientific Agents with Reinforcement Learning

The scientific process can be repetitive and tedious, with researchers spending hours digging through papers, managing experiment workflows, or wrangling massive multi-modal datasets. Scientific AI agents can take on much of that busywork, acting as assistants that review literature, generate hypotheses, plan experiments, submit computational jobs, orchestrate lab operations, analyze results, and summarize findings. That frees up researchers to focus on creative thinking and scientific discovery.

But building scientific AI assistants is challenging. Agents must maintain a high-level plan over many steps of research, incorporating memory and context management. A single mistake can potentially derail a research task. Moreover, domain-specific tools are challenging for general-purpose LLMs to leverage, especially in cutting-edge research areas. Verification of results with computational or real-world data can take a long time, requiring an agent to maintain coherence over hours, days, or more.

Available as open-source libraries within the NVIDIA NeMo framework suite, NVIDIA NeMo Gym and NeMo RL offer a unified, modular reinforcement learning stack for building reliable agentic AI across any domain, including scientific research. NeMo Gym enables developers to create realistic environments where agents can interact, learn, and solve domain-specific tasks generating high-quality, verifiable, domain-specific rollout data. This training data can then be used with NeMo RL to adapt and improve these agents efficiently at scale.

Both libraries played a key role in the post-training of the latest Nemotron-3-Nano, a cost-efficient model optimized for targeted tasks, delivering high accuracy at low inference cost.

One developer using NeMo Gym and NeMo RL is Edison Scientific, which is working on automating scientific discovery. The spinoff of nonprofit research organization FutureHouse uses the infrastructure to power Aviary, a framework of scientific RL training environments spanning biology, chemistry, and related domains.

In this blog, we demonstrate how to implement agentic training environments using NeMo Gym and use them in training with NeMo RL. We feature Aviary as an example of a domain-specific reinforcement-learning environment for science.

How reinforcement learning extends LLM capabilities for science

Not all LLMs can execute complex scientific workflows. Pre-training teaches a model to predict the next token, which builds broad knowledge but not domain skills. This foundation allows zero-shot performance on structured factual questions, such as gene–disease links, drug mechanisms, or clinical timelines. Post-training then teaches the model to follow instructions and reflect domain preferences through iterative tuning and alignment.

Post-training usually begins with supervised fine tuning (SFT), where the model learns from instruction-response pairs using a next-token prediction log-likelihood loss. This process depends heavily on high-quality expert or filtered synthetic data and is sensitive to errors. SFT is limited by the coverage of its datasets, and the loss function only rewards reproducing the reference answer, even when alternative correct outputs exist, such as different valid code implementations.

Training pipelines sometimes add reinforcement learning (RL) to expand a model’s ability to reason and act beyond supervised data. RL uses a reward function to score outputs from the model or policy during training. In reinforcement learning from human feedback (RLHF), humans rank responses based on their preference or a rubric. Reinforcement learning from AI feedback (RLAIF) removes the human preference step by using LLMs as a judge. Reinforcement learning with verifiable rewards (RLVR) uses computational checks, such as code execution, to produce objective and repeatable reward signals.

RLVR is useful for training scientific agents because it allows models to design and run experiments, evaluate outcomes, and optimize toward scientific metrics through verification design and reward shaping. Scientific RL can be run in multi-step environments where an agent takes actions, observes feedback, and continues until a task is complete. Training may use full trajectories or individual state transitions. Through RL, scientific agents can compose skills learned in pre-training and SFT to build new workflows and achieve specific scientific goals.

How NeMo Gym and NeMo RL improve agentic training and evaluation

Implementing RL for LLM agents requires a training framework and environment to define what the agent can do, what it observes, and what rewards it gets for its actions. The training framework, such as NeMo RL, runs training algorithms like group relative policy optimization (GRPO), manages compute for rollouts and verification and orchestrates updates to the model weights. The latest NeMo RL release supports on-policy distillation, asyncRL, advanced RL algorithms, and end-to-end FP8 RL training.

An agent drives the interaction loop with the environment by taking actions and leveraging necessary tools, while the environment provides observations and rewards for actions, maintains a persistent state, and determines when a task is complete. Environments may range from a simple Python execution sandbox to a full research software stack for evaluating workflows such as moleculer cloning.

Training an AI scientist requires models that excel at many complex tasks. In practice, this means hundreds to thousands of diverse tasks across use cases like literature synthesis, hypothesis generation, experimental design, and data analysis, each requiring its own verification logic. As task diversity grows, training environment infrastructure management becomes challenging due to varied dependencies and domain-specific requirements. To address this, we created NeMo Gym—an open source framework for building RL training environments at scale.

NeMo Gym serves as the hub for RL data, environments, and reward signals used in LLM post-training. It provides the infrastructure to develop training environments, scale rollout collection, and integrate seamlessly with your preferred training framework. Environments are isolated and expose REST APIs, enabling parallel execution and scalable deployments without dependency conflicts.

NeMo Gym provides three core server abstractions. A training environment typically includes all three server types working together:

Model: Wraps OpenAI-compatible endpoints with reasoning and tool-calling support. Models can run locally or in the cloud and work with multiple backends including OpenAI, Azure, and vLLM. This abstraction separates model deployment from agent logic.
Resources: Provides tool implementations that can be invoked via tool calling and verification logic that measures task performance. This abstraction offloads heavy processing so agents can asynchronously call both models for inference and resources for tool execution and verification.
Agents: Orchestrate interactions between models and resources—routing requests, coordinating multi-turn conversations, and formatting responses consistently.

NeMo Gym generates rollouts and rewards from complex training environments, producing the optimization targets that RL training requires. Interoperable with existing environments, systems, and RL training frameworks, NeMo Gym lets users leverage both custom and NVIDIA-curated environments for LLM post-training. When paired with NeMo RL for training algorithms and infrastructure, the two libraries provide a scalable pipeline for agentic training and reinforcement learning.

NeMo Gym in practice: Training scientific reasoning agents at Edison Scientific

Edison Scientific is using NeMo Gym and NeMo RL to scale AI agents that automate scientific discovery. That includes Aviary, which can train agents in biology, chemistry, and related domains. It can perform tasks such as literature research, bioinformatic data analysis, laboratory tasks like solving molecular cloning problems, and multi-step scientific problem-solving.

Aviary manages state, tool execution, rewards, and observation formatting for RL environments. Its open source repository includes environments for math, scientific literature research, and data analysis. NeMo Gym runs on top of Aviary, allowing Aviary to control its environment logic while NeMo Gym provides scalable rollout collection, additional NVIDIA-curated training environments, and integration with NeMo RL for training at scale.

Each Aviary environment implements two core methods: reset() and step(). The reset method initializes the environment, returns the first observation, and lists available tools. The step method executes an action and returns new observations, rewards, and termination or truncation signals. Actions are tool requests that may include multiple tool calls.

Using Aviary through NeMo Gym, Edison Scientific is training a Jupyter-notebook data-analysis agent for bioinformatics tasks. At each step, the agent views the notebook and edits a cell. Notebook size can exceed the model context window, so Edison Scientific added two features to manage context growth. The company drops interaction history so the agent sees only the original instruction, all previous actions, and the current notebook, and it modified GRPO grouping to operate on individual steps rather than full trajectories. This allows training on transitions, reduces context length, and enables step-level reward signals.
As a testbed, Edison Scientific built a Jupyter-based data analysis environment in Aviary, integrated it with NeMo Gym, and introduced a benchmark of verifiable bioinformatics questions called BixBench.

Software architecture diagram showing the relationship between NeMo Gym, NeMo RL, and downstream scientific tasks. NeMo Gym provides agents, resources, and model interfaces that generate training signals from scientific and synthetic data. NeMo RL supplies the models, RL algorithms, and training toolkit used to optimize agent behavior. Outputs are shown flowing to downstream evaluation and inspection, though this represents a conceptual sketch rather than a fully deployed end-to-end system. — Figure 1. High-level architecture of NeMo Gym and NeMo RL for training reinforcement learning agents for scientific tasks. This sketch illustrates how NeMo Gym-managed training environments connect to NeMo RL training infrastructure to train agents that will support downstream scientific workflows.

Building agentic environments in NeMo Gym for training or downstream use is straightforward, requiring just a few steps

Step 1: Install NeMo Gym

Clone the NeMo Gym repo, install the uv Python package, and create a virtual environment:

# Clone the repository
git clone git@github.com:NVIDIA-NeMo/Gym.git
cd Gym

# Install UV (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

# Create virtual environment
uv venv --python 3.12
source .venv/bin/activate

# Install NeMo Gym
uv sync --extra dev --group docs

Step 2: Configure the model

You can use a hosted model such as from OpenAI, or deploy a model locally such as through NVIDIA NIM or vLLM. In this example, we will use nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 from HuggingFace and deploy the model with vLLM with tool-calling enabled. For more detailed information on how to use the model with vLLM, please see this cookbook.

pip install -U "vllm>=0.12.0"

wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16/resolve/main/nano_v3_reasoning_parser.py

vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 \
 --max-num-seqs 8 \
  --tensor-parallel-size 1 \
  --max-model-len 262144 \
  --port 10240 \
  --trust-remote-code \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
  --reasoning-parser nano_v3

Then, create an env.yaml file in the NeMo Gym root directory:

policy_base_url: http://localhost:10240/v1
policy_api_key: EMPTY
policy_model_name: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Step 3: Test an Aviary environment with a simple agent in NeMo Gym

Now let’s run an agent through the GSM8K environment, a math problem set where the agent can use a calculator tool.

In NeMo Gym, the `ng_run` command is used to launch servers. To configure the servers, config files must be provided. Here, we provide two config files: `gsm8k_aviary.yaml` configures the resources server and agent server, and `vllm_model.yaml` defines the model server.

ng_run "+config_paths=[resources_servers/aviary/configs/gsm8k_aviary.yaml,responses_api_models/vllm_model/configs/vllm_model.yaml]"

Once all servers are running, you should receive logs similar to below:

All 3 / 3 servers ready! Polling every 60s

####################################################################################################
#
# Server Instances
#
####################################################################################################

[1] gsm8k_aviary_resources_server (resources_servers/aviary)
{
    'process_name': 'gsm8k_aviary_resources_server',
    'server_type': 'resources_servers',
    'name': 'aviary',
    'dir_path': (
        '/home/ubuntu/Gym/resources_servers/aviary'
    ),
    'entrypoint': 'gsm8k_app.py',
    'host': '127.0.0.1',
    'port': 18575,
    'pid': 1582343,
    'config_path': 'gsm8k_aviary_resources_server',
    'url': 'http://127.0.0.1:18575',
}
[2] gsm8k_aviary_agent (responses_api_agents/aviary_agent)
{
    'process_name': 'gsm8k_aviary_agent',
    'server_type': 'responses_api_agents',
    'name': 'aviary_agent',
    'dir_path': (
        '/home/ubuntu/Gym/responses_api_agents/aviary_agent'
    ),
    'entrypoint': 'app.py',
    'host': '127.0.0.1',
    'port': 63115,
    'pid': 1582344,
    'config_path': 'gsm8k_aviary_agent',
    'url': 'http://127.0.0.1:63115',
}
[3] policy_model (responses_api_models/vllm_model)
{
    'process_name': 'policy_model',
    'server_type': 'responses_api_models',
    'name': 'vllm_model',
    'dir_path': (
        '/home/ubuntu/Gym/responses_api_models/vllm_model'
    ),
    'entrypoint': 'app.py',
    'host': '127.0.0.1',
    'port': 55951,
    'pid': 1582347,
    'config_path': 'policy_model',
    'url': 'http://127.0.0.1:55951',
}
####################################################################################################

Next, run the agent in the GSM8K environment. The following command will run the simple agent on the 5 example problems in the input file, and write the agent trajectories to the output file.

ng_collect_rollouts \
    +agent_name=gsm8k_aviary_agent \
    +input_jsonl_fpath=resources_servers/aviary/data/gsm8k_example.jsonl \
    +output_jsonl_fpath=results/gsm8k_aviary_rollouts.jsonl

You should receive an output showing the average reward of the trajectories:

Collecting rollouts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:18<00:00,  3.71s/it]
{
    "reward": 1.0,
}

To view the trajectories, NeMo Gym provides a simple UI:

ng_viewer +jsonl_fpath=results/gsm8k_aviary_rollouts.jsonl

Step 4: Build a new environment

To create a new environment in NeMo Gym, you can first build the environment in Aviary, then easily create a new resources server through the Aviary integration. Or, you can create a custom environment from scratch directly in NeMo Gym. In this example, let’s add the Aviary HotPotQA environment to NeMo Gym.
First, create `resources_servers/aviary/hotpotqa_app.py`, which extends the base aviary resources server:

from pydantic import Field
from aviary.envs.hotpotqa import HotPotQADataset, HotPotQAEnv
from resources_servers.aviary.app import AviaryResourcesServer


class HotPotQAResourcesServer(AviaryResourcesServer[HotPotQAEnv, HotPotQADataset]):
   dataset: HotPotQADataset = Field(default_factory=lambda: HotPotQADataset(split="train"))


if __name__ == "__main__":
   HotPotQAResourcesServer.run_webserver()

Next, create a configuration file in `resources_servers/aviary/configs/hotpotqa_aviary.yaml`:

hotpotqa_aviary_resources_server:
  resources_servers:
    aviary:
      entrypoint: hotpotqa_app.py
hotpotqa_aviary_agent:
  responses_api_agents:
    aviary_agent:
      entrypoint: app.py
      resources_server:
        type: resources_servers
        name: hotpotqa_aviary_resources_server
      model_server:
        type: responses_api_models
        name: policy_model
      datasets:
      - name: train
        type: train
        jsonl_fpath: resources_servers/aviary/data/hotpotqa_train.jsonl
        gitlab_identifier:
          dataset_name: hotpotqa_train
          version: 0.0.1
          artifact_fpath: hotpotqa_train.jsonl
        license: Apache 2.0
      - name: validation
        type: validation
        jsonl_fpath: resources_servers/aviary/data/hotpotqa_validation.jsonl
        gitlab_identifier:
          dataset_name: hotpotqa_validation
          version: 0.0.1
          artifact_fpath: hotpotqa_validation.jsonl
        license: Apache 2.0
      - name: hotpotqa_example
        type: example
        jsonl_fpath: resources_servers/aviary/data/hotpotqa_example.jsonl
        gitlab_identifier:
          dataset_name: hotpotqa_example
          version: 0.0.1
          artifact_fpath: hotpotqa_example.jsonl
        license: Apache 2.0

Then create a example dataset in `resources_servers/aviary/data/hotpotqa_example.jsonl`, which provides task indices to retrieve samples from the underlying Aviary environment dataset:

{"task_idx":0,"responses_create_params":{"input":[]}}
{"task_idx":1,"responses_create_params":{"input":[]}}
{"task_idx":2,"responses_create_params":{"input":[]}}
{"task_idx":3,"responses_create_params":{"input":[]}}
{"task_idx":4,"responses_create_params":{"input":[]}}

Lastly, update ‘requirements.txt’ to include the `hotpotqa` package from Aviary:

-e nemo-gym[dev] @ ../../
fhaviary[gsm8k,hotpotqa,notebook,llm]>=0.24.1
tqdm
datasets
huggingface-hub

With these four changes, we can now run the Aviary HotPotQA environment in NeMo Gym.

Visit the NeMo Gym repository for more ready-to-use training environments. The product documentation also provides a more comprehensive overview of key concepts, how to create resource servers, and how to perform RL training. Check out our latest NeMo RL release which supports on-policy distillation, asyncRL, advanced RL algorithms, and end-to-end FP8 RL training.

Best practices for building scientific agents

Building scientific agents is challenging, but the following practices can help teams make steady progress toward more capable systems.

Start simple. Begin with a basic agent rather than a multi-agent system and many tools. Use outcome-based rewards before introducing complex reward structure, which can lead to reward hacking
Reward profiling. Training with GRPO style algorithms works well when the model can produce a diverse set of solutions to a task, some of which are correct. Measuring the mean and standard deviation of reward for each task over multiple attempts can help to create a more efficient training environment for a model.
Monitor training metrics. Various metrics describing training stability, model behavior, and learning progress are automatically logged to Weights & Biases. For example, sampling issues, model collapse, or truncated trajectories can be detected through analyzing different metrics.
Train longer. Training with RLVR based methods can show little learning in early stages, followed by a steeper learning curve later in training. This can happen when the model struggles to find correct solutions for the tasks, but later discovers a strategy that works.

These steps provide a practical path to building and training scientific agents at scale with NeMo Gym, NeMo RL, and Aviary. Get started building your new scientific agent today. And for more information, check out the following links:

And also check out our new NVIDIA Nemotron 3 model family, with Nano available now.

Contributors to this work included Brian Yu, Chris Wing, Elliot Eshelman, and Sylendran Arunagiri.