Reinforcement learning (RL) is central to aligning language models, from reinforcement learning with human feedback (RLHF) within AI assistants to newer reinforcement learning with verifiable rewards (RLVR) workflows for reasoning and agent tasks.
RL is now becoming a practical technique for specialized AI where enterprises need more accurate agents for domain-specific workflows. Open models provide more control over data, IP, and deployment, while RL turns domain success criteria into training signals.
Frontier labs have shown RL can improve general model capabilities. OpenAI trained their o-series models with large-scale RL, and DeepSeek-R1 showed how group relative policy optimization (GRPO) and verifiable rewards improve math, code, and reasoning behavior.
NVIDIA Nemotron 3 Super was post-trained using multi-environment RL across 21 NVIDIA NeMo Gym verifiers and 37 datasets, generating about 1.2 million environment rollouts.
This guide helps model-builders, research teams, and agent developers decide when to use RL and how to run a first verifiable RL training loop for long-running agents.
Why RL matters for agents
Organizations need specialized agents for workflows such as security triage, scientific discovery, CLI automation, customer support, data analysis, and internal tool use. Customizing open models like Nemotron makes this practical. Teams can specialize for accuracy and speed while keeping control over data, IP, and deployment.
Prompting, RAG, and tools can get you far. Think of the model as the agent’s brain, the agent harness as its body, and tools as the workspace it can act in. Improving the harness or adding tools can help, but it doesn’t always change model behavior. If the agent repeats tool-call mistakes, fails in long workflows, formats outputs incorrectly, or chooses the wrong strategy, you need a training signal. That is where RL fits.
RL lets you define success, generate attempts, score them, and update model weights so successful behavior becomes more likely. In agentic systems, that reward can come from a verifier: code that scores outputs or trajectories using tests, tool execution, schema validation, simulators, reward models, LLM-as-judge review, human preference labels, or other task-specific feedback.

Nemotron, NVIDIA NeMo RL, and NVIDIA NeMo Gym provide open models, post-training workflows, and environment infrastructure that work with ecosystem tools such as OpenRLHF, PrimeIntellect, SGLang, Unsloth, veRL, and vLLM.
RAG, prompting, SFT, and RL: When to use what
Avoid starting with: “Which algorithm should I use?”
Start with: “What behavior do I want to increase, and how will I measure it?”
Here is an example decision matrix:
| Problem | First technique to try |
| The model lacks domain facts | RAG or data injection |
| The model does not follow a format | Prompting, then SFT |
| The model needs to imitate examples | SFT (with LoRA or QLoRA for efficient adaptation) |
| You have preferred vs rejected outputs | DPO |
| You can verify success algorithmically | RLVR with GRPO |
| You need nuanced human preference alignment | RLHF or reward modeling |
| The agent fails across long-horizon workflows with multiple tool calls, state updates, or conversational turns | Environment-based RL with trajectory-level rewards |
SFT vs DPO vs RLVR vs RLHF
Use SFT when you have demonstrations of desired behavior, such as instruction following, multi-turn conversations, output schemas, tool-call formats, or domain workflows.
Use DPO when you have preference pairs, where one answer is better than another.
Use RLHF when nuanced human preferences cannot be captured by rules and you can support preference data, reward models, and careful training infrastructure.
Use RLVR when correctness can be checked algorithmically, such as valid JSON, correct CLI commands, passing tests, exact math answers, successful tool calls, or simulator outcomes.
The best method depends on the signal you have. For verifiable tool-use and agent workflows, a common starting path is: SFT if needed → GRPO with verifiable rewards → evaluate → inspect failures → repeat.
GRPO is a good default for RLVR or verifiable tasks
For RLVR workflows, GRPO is often practical to start with. It generates multiple completions per prompt, scores them with a verifier, and updates the model based on relative performance within the group. Compared with PPO-style RLHF, GRPO has fewer moving parts and works naturally with rule-based rewards, becoming a default for many agentic RL examples.
Newer variants continue to emerge as RL training systems mature. For example, dynamic sampling policy optimization (DAPO) builds on GRPO with dynamic sampling and asymmetric clipping to preserve useful learning signal and exploration diversity, while group sequence policy optimization (GSPO) optimizes at the sequence level instead of the token level to improve training stability, especially for Mixture-of-Experts (MoE) models.
The rest of this guide focuses on a practical RLVR workflow using GRPO and environment-based evaluation.
The minimal RL loop
An RL training run for LLMs or agents has seven parts:
- Policy model: the model you are training
- Task: the input the model receives
- Action: the model output, tool call, code patch, command, or multi-step trajectory
- Environment: the system that executes the action and provides feedback
- Verifier: the signal that scores success or produces rewards
- Rollouts: sampled attempts from the current model
- Policy update: the training step that increases the probability of better outputs
This RL 101 Glossary provides a detailed understanding of each component and how they work together.
Start with evaluation before training. Run the current model on a held-out task set, inspect failures, and profile the verifier or reward function before updating weights. RL works best when the model can sometimes produce the right behavior but doesn’t do so reliably. If the reward is wrong, RL will optimize the wrong behavior.
The common challenges that developers face with RL are around data, environment design, reward design, and compute decisions.
Task and training data
For SFT, you need input-output examples that teach the model the desired behavior. For RLVR, you need tasks, environment logic with verifiers and tools that can score outputs. Synthetic data generation helps expand coverage when real examples are sparse: generate task variants, edge cases, tool-call scenarios, and expected outputs, then filter them with validators, reward models, or LLM-as-judge review.
NVIDIA NeMo Data Designer can help generate structured task datasets from scratch or seed data, control relationships between fields, batch generation, and validate outputs against specifications. NeMo Gym can then run those tasks through environments to generate scored trajectories from models or teacher agents, which can be used for SFT demonstrations, preference pairs, or RL rollouts.
Synthetic data is not ground truth. Keep a small human-quality seed set, deduplicate aggressively, hold out eval tasks, and inspect failure cases before using as training data.
Agentic RL needs environments, not just datasets
For simpler single-turn tasks, a static dataset may be enough:
{"prompt": "Return valid JSON for this command...", "answer": "..."}
For agentic RL spanning across single-step/turn, multi-step, and multi-turn workflows, you need an environment defining the:
- Dataset
- Agent Harness
- Verifier
- State
For example, a long-running coding agent may need many tool calls before tests pass; a data analysis agent may need to inspect files, run queries, generate charts, and validate results; a scientific agent may need to search literature, call simulators, and revise hypotheses.
The environment can be simple or complex: a parser plus answer checker for math, unit tests for code generation, or a sandbox with tools, files, turn limits, and task-specific success criteria for long-running agents.
Evals and environments are two sides of the same system. A good eval conveys whether the model succeeded, and a good RL environment turns that signal into training data.
NeMo Gym provides a scalable, reproducible way to build environments that connect agents, models, external systems, tools, and verifiers, with tutorials for single-step, multi-step, stateful, real-world, and LLM-as-judge environments.

Reward and verifier design: Start simple
Reward design is where many RL projects get overcomplicated.
Start with the simplest reward that proves the loop works. For RLVR, this can start with binary: +1 if the output passes the verifier, 0 otherwise
Add intermediate signals only when they measure real progress. For a coding agent, it could be
+0.1 selected the right tool
+0.2 produced valid intermediate artifact
+0.3 passed partial test
+1.0 completed task
-1.0 unsafe action
Too much shaping can teach the model to optimize the checklist instead of the task. Good reward functions have three properties:
- They measure the real task.
- They are hard to game/hack.
- They fail visibly when wrong.
Before training, run your reward function against 50-100 model outputs and inspect the scores manually. If the reward disagrees with your judgment, fix the reward.
Compute: Budget for training and rollouts
RL cost comes from two workloads: rollout and training. Both are affected by batch size, model size and sequence lengths. Rollout cost specifically scales with number of tool calls, conversation turns and environment steps. Inference software such as vLLM improves rollout latency while NeMo Gym improves tool calls orchestration.
Alternately, training cost scales with dataset size and number of policy updates. Training software such as Megatron and NeMo Automodel improve throughput. RL frameworks such as NeMo RL build on top of these inference and training softwares, enabling an efficient loop for optimal model learning.
GPU needs vary by workload. A small adapter-based ~1B-8B experiment can start on a single modern GPU or small multi-GPU node. Larger models, full fine-tuning, long-context tasks, high rollout counts, or multi-step agent environments need multiple GPUs. When compute is limited, reduce model size, max tokens, generations per prompt, and parallel environments first.
For early experiments, start small. Smaller models excel at debugging data, verifiers, environments, and training loops. Complex tasks need a more capable model before the reward curve or held-out evals show meaningful improvement.
A practical first RL training run is small, verifiable and inspectable
Let’s build a workplace assistant-style agent that generates correct JSON tool calls from natural-language requests:
First, pipe-clean the setup with the NeMo RL getting-started example.
Then move to a multi-step example such as the NeMo RL and NeMo Gym Workplace Assistant tutorial, which trains Nemotron Nano 9B v2 with GRPO for tool-calling across project-management workflows. We will start with a simple one-step tool use task which can be extended to multi-step environments too.
Step 1: Pick one behavior
Choose one behavior for automatic evaluation.
Example: Given a natural-language request, produce the correct JSON tool call for an internal CLI or API.
{
"prompt": "Create a calendar event for Alex next Tuesday at 2 PM.",
"expected_tool": "calendar.create_event",
"expected_args": {
"attendee": "Alex",
"day": "Tuesday",
"time": "14:00"
}
}
Step 2: Run a baseline eval and classify failures
Prepare separate training and validation task files before training. Run the baseline model on the validation set and measure valid JSON rate, correct tool or command rate, correct argument rate, execution success rate, and unsafe action rate.
Then inspect outputs and group failures by type: format errors, wrong tool or command, inconsistent success, unsafe actions, or long-horizon failures. Confirm the model has a measurable failure pattern before training.
This tells you which training method to use.
Step 3: Decide whether SFT is needed
If the model rarely follows the expected format or tool family, start with SFT. If it can sometimes succeed but is inconsistent, move to RLVR with GRPO.
A practical path is:
- SFT for format and task understanding
- RLVR with GRPO for reliability improvement
- Held-out evals before deployment
Step 4: Build the verifier or reward function
Once RL is the right next step, turn your eval logic into a reward function. Start simple and deterministic.
def reward(output, expected):
if not is_valid_json(output):
return -1.0
parsed = json.loads(output)
score = 0.0
if parsed["tool"] == expected["expected_tool"]:
score += 0.4
score += 0.4 * argument_match(
parsed["args"],
expected["expected_args"]
)
if executes_safely(parsed):
score += 0.2
return score
Run this verifier against sample outputs before training. If the reward disagrees with what you would consider correct, fix the verifier first.
Step 5: Run a small GRPO job
Start with a small model or adapter-based run. Keep the first run intentionally narrow.
model: nvidia/Nemotron-Nano-9B-v2
algorithm: grpo
adapter: lora
num_generations_per_prompt: 8
reward: tool_call_verifier
eval_interval: 10
held_out_eval: tool_call_eval
Step 6: Track the right metrics
Do not only track training rewards. Track validation reward, success rate, invalid outputs, unsafe actions, latency, and cost. See if validation reward or accuracy improve over the baseline on tasks the model did not train on.
Step 7: Inspect failures and promote carefully
Sample outputs at every checkpoint. Look for reward hacking, formatting regressions, unsafe actions, or cases where reward improves but real quality gets worse. Ship only after the tuned model beats the baseline on held-out evals without regressing on safety, latency, or general capability checks.
Continuous improvement for long-running agents
A long-running agent should improve like a software system. Use RL as a practical loop for continuously improving a production-grade agentic workflow:
- Log real trajectories
Capture prompts, tool calls, observations, outputs, failures, human interventions, and final outcomes. - Convert failures into evals
Every production failure should become a regression test or environment task. - Bucket failure modes
Separate format errors, wrong tool choice, bad planning, unsafe actions, retrieval failures, hallucinated APIs, and incomplete execution. - Choose the lightest fix
Prompt or tool change first. SFT for repeated format or domain behavior. DPO for preference quality. RLVR/GRPO when you can verify success. - Train on held-out tasks
Keep a fixed eval set that the model never trains on. Add fresh tasks continuously. - Promote only if behavior improves
Compare baseline vs tuned model on success rate, safety failures, cost, latency, and regression tests. - Keep tuned models versioned
For agent specialization, version each tuned checkpoint or adapter so you can compare behavior, roll back safely, and avoid overwriting the base model.
The output of this loop is an agent flywheel: production failures become evals, evals become environments, environments generate rewards, rewards improve the model, and the next model is tested before deployment.
Get started with reinforcement learning for LLMs and agents
The shortest path to a useful RL run is a clear task, a trustworthy verifier, a small baseline model, and a held-out eval that tells you whether the model actually got better.
Ready to apply RL to your own models and agents? Accelerate development with NVIDIA NeMo and Nemotron:
- Open models, recipes, and agentic AI examples with Nemotron
- Scalable post-training with NeMo RL
- Verifiable reward environments, rollouts, and evaluation with NeMo Gym
- Synthetic data generation with NeMo Data Designer
- A detailed guide on different techniques for customizing agents
These tools help developers move from task definition and data generation to training, evaluation, and optimization without rebuilding the full RL stack from scratch.