How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

What if your computer-use agent could learn a new Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands?

In Part 1 of our series on building a computer use agent, we built a custom Bash computer-use agent using NVIDIA Nemotron in just one hour. In this sequel, we’ll take it further by teaching the same reasoning model with no prior knowledge to safely operate the LangGraph Platform CLI. This shows how easily a large reasoning model can be specialized to perform new, agentic tasks.
Instead of simple file operations, our new agent will learn to start local servers, build containers, and generate Dockerfiles—entirely through a verifiable, human-in-the-loop command interface.

We’ll combine synthetic data generation (SDG) and Reinforcement Learning with Verifiable Rewards (RLVR), optimized via Group Relative Policy Optimization (GRPO), to make training both efficient and safe.

What you’ll build: a specialized agent to run a new CLI tool

You’ll fine-tune an AI agent that can:

Propose valid LangGraph CLI commands (e.g., langgraph dev –port 8123 –no-browser)
Ask for explicit human confirmation before executing
Learn new subcommands from synthetic seed data
Train efficiently on a single GPU using RLVR

Here’s what a typical interaction looks like once the model is trained:

[🙂] Bring the LangGraph server online.

[🤖] I can execute:
[COMMAND]
["langgraph", "up", "--wait"]
[CONFIRM]
Run this command now? (yes/no)

▶️  Execute `langgraph up --wait`? [y/N]: y

[🤖] Result:
Server started successfully on port 8000.

This pattern generalizes: The same workflow can be extended to support new CLI tools and environments.

Why use synthetic data generation and reinforcement learning to teach a new CLI?

Teaching an AI agent to operate a specialized CLI tool presents unique challenges that traditional approaches struggle with:

The data scarcity problem: Most specialized CLI tools lack the massive usage logs needed for conventional training. Unlike common shell commands, tools like LangGraph have specific syntax, flags, and workflows that aren’t well-represented in general training data. Waiting to collect real-world usage examples could take months or years.

The safety-accuracy tradeoff: You want your agent to be creative in understanding user intent, but absolutely precise when generating commands. A single typo or wrong flag could cause system errors or worse. Traditional fine-tuning often produces models that are either too conservative (refusing valid requests) or too permissive (hallucinating dangerous commands).

How SDG + RL solves this:

Synthetic data generation lets you bootstrap high-quality training examples from just a handful of seed commands, ensuring complete coverage of the CLI’s capabilities.
Reinforcement learning with verifiable rewards teaches the model to consistently produce syntactically correct commands by rewarding valid outputs and penalizing errors.
Together, they create a virtuous cycle: SDG provides diverse training scenarios, while RLVR ensures the model learns to handle them correctly.

This approach is particularly powerful for enterprise environments where you might need to quickly adapt agents to proprietary internal tools without waiting for organic data collection.

Prerequisites

For this setup, you’ll need:

Hardware requirements:

Access to an NVIDIA GPU with at least 80 GB memory (e.g., A100)
Minimum 32 GB system RAM
100 GB free disk space for model weights and datasets

Software requirements:

Python 3.10 or newer
CUDA 12.0+ and appropriate NVIDIA drivers

Core components:

LangGraph – The target CLI tool our agent will learn to operate
NeMo Gym – For building the RL training environment with tools and verifiable rewards
Unsloth – For efficient GRPO-based reinforcement learning with reduced VRAM requirements
NeMo Data Designer – For generating synthetic training data

Base model:

Nemotron-Nano-9B-V2 – Available on Hugging Face
Installation and usage instructions are provided in the linked documentation

Check out a video version of this tutorial:

Video 1. Use SDG and RL to produce a LangGraph CLI BASH Agent.

Step 1: Design a synthetic dataset with NeMo Data Designer

Before training, we need data: pairs of natural-language requests mapped to LangGraph CLI invocations.

We’ll use the NVIDIA NeMo Data Designer to programmatically generate this dataset, starting from a handful of seed examples and expanding into hundreds of verified command pairs.

Why use synthetic data generation?

Think of synthetic data generation like teaching someone a new language by showing them a pattern, then having them create variations. Instead of collecting thousands of real examples (which might not exist yet), we:

Provide a few high-quality “seed” examples
Use an AI model to generate diverse variations
Validate each generated example against strict rules
Build a comprehensive dataset in hours instead of months

The dataset structure

User request	CLI command	Confirmation
“Start a local dev server on port 8123.”	langgraph dev –port 8123 –no-browser	“Proceed with this command? (yes/no)”
“Build the project image for both amd64 and arm64.”	langgraph build -t my-graph:multi –platform linux/amd64,linux/arm64	“Run build now?”

Each generated record includes:

User request: Natural language that a human might actually type
CLI command: The exact, syntactically correct command to execute
Confirmation prompt: A safety check before execution

The validation process

In Data Designer, we steer diversity with sampling parameters and reject any record that fails validation. For example, we might use a regex pattern like:
^langgraph\s+(dev|build|up|dockerfile)\b

This ensures that:

Every command starts with langgraph
Only approved subcommands are used
The syntax is always valid

Finally, we export the dataset in OpenAI-style messages format—ideal for RLVR fine-tuning with the open-source NVIDIA NeMo framework.

This validation process matters: It guarantees that the reward verifier (introduced later) will be consistent with the structure and syntax of the training data.

Let’s look at the implementation in NeMo Data Designer.

# Define seed distributions
command  = Sampler(["new", "dev", "up", "build", "dockerfile"])
port     = Sampler(range(3000, 9000))
template = Sampler(["react-agent", "memory-agent", "retrieval-agent"])

# Generate natural language input
user_request = LLM(
    prompt=f"Write a request to {command} with {template} on port {port}",
    model="nemotron-3-nano-30b-a3b"
)

# Generate structured output
tool_call = LLM(
    prompt=f"Convert '{user_request}' to CLI JSON.",
    schema=CLIToolCall,
    model="nemotron-3-nano-30b-a3b"
)

Step 2: Fine-tune with RLVR (using GRPO)

With clean, verified data in hand, we move to fine-tuning using Unsloth, an open source framework for efficient reinforcement learning that integrates with NeMo Gym training environments

Reinforcement Learning with Verifiable Rewards (RLVR)

Traditional reinforcement learning from human feedback (RLHF) is like having a panel of judges score each output—subjective, expensive, and inconsistent. RLVR replaces human judges with deterministic code-based verification.

Instead of asking humans “Does this command look good?,” we ask code “Does this command pass our validation rules?”

For a CLI agent, the verifier enforces rules such as:

Output must start with langgraph
Only approved subcommands and flags allowed
No commentary, punctuation, or unsafe tokens

The reward system:

✅ Valid command → +1 reward (encourages this behavior)
❌ Invalid command → −1 reward (discourages this behavior)
⚪ Ambiguous output → 0 reward (neutral, no reinforcement)

This consistency is crucial: The same output always yields the same reward, making training stable and predictable. And because the verifier is just code, you can adjust constraints anytime without retraining a separate reward model.

Building the training environment with NeMo Gym

NeMo Gym is an open source library for building reinforcement learning training environments for LLMs. It provides the infrastructure to define tools, execute agent actions, and compute verifiable rewards—exactly what we need for training a CLI agent.

The CLI agent environment is implemented as a NeMo Gym resource server, which encapsulates:

Tool definitions – The CLI commands the agent can propose
Verification logic – Rules that check command validity and correctness
Reward computation – Scores (0.0 to 1.0) returned to the RL training loop

When the agent proposes commands, the resource server evaluates correctness and returns reward signals for GRPO training. This clean separation between environment logic and training framework means you can iterate on your CLI tools and validation rules without touching the RL code.
To learn more about creating custom environments, see the NeMo Gym documentation and the guide on creating resource servers.

Optimization via Group Relative Policy Optimization (GRPO)

GRPO is a simpler, more memory-efficient alternative to PPO. Instead of training a separate “critic” model to estimate how good each action is, GRPO samples multiple outputs for the same prompt and uses their average reward as the baseline. This cuts the model count in half (no critic needed) and reduces variance by comparing outputs against each other rather than against a learned estimate.

Here’s how it works in practice:

Traditional RL might struggle when most attempts fail. Imagine the model generates 10 command variations for the same prompt:

Nine are invalid (reward = 0)
One is valid (reward = 1)

Standard optimization might get lost in the noise of failures. GRPO instead:

Groups all responses to the same prompt together
Computes relative advantages within each group
Strongly reinforces that one success, making it stand out from the failures

This approach dramatically improves learning efficiency and convergence speed, helping the model quickly learn what makes a command valid.

Let’s see how we’d implement this with Unsloth and NeMo Gym:

# The "Verifiable Reward" Function
def compute_reward(agent_output, expected):
    try:
        cmd = json.loads(agent_output)

        # Hard Rule: Command must match expectation
        if cmd.name != expected.name:
            return -1.0  # Penalize hallucinations

        # Soft Rule: Flags must be accurate
        accuracy = calculate_flag_accuracy(cmd.flags, expected.flags)
        return accuracy

    except JSONDecodeError:
        return -1.0  # Penalize broken syntax

# Start GRPO Training
grpo.train(
    model="nemotron-nano-9B-v2",
    algorithm="GRPO",
    env=compute_reward,
    dataset=synthetic_data
)

Step 3: Human-in-the-loop execution

Once fine-tuned, we embed the model into a runtime loop that always requests human confirmation before execution. This maintains the safety architecture introduced in Part 1, ensuring no command runs without explicit approval.

The safety architecture

subprocess.run(argv, shell=False)

This simple line embodies a crucial security principle. By setting shell=False, we ensure:

Commands execute as discrete argument lists (e.g., [“langgraph”, “up”, “–wait”])
Shell metacharacters like &&, ;, or | are treated as literal text, not operators
Command injection attacks become impossible

The complete safety chain

Our multi-layered approach ensures safety at every step:

Training-time safety: RLVR ensures the model learns to generate valid commands.
Runtime verification: A validator checks every proposed command against allowlists.
Human confirmation: Users must explicitly approve each command before execution.
Execution isolation: Commands run without shell interpretation, preventing injection.

Even if the model occasionally produces an invalid command despite training, the runtime policy prevents it from being executed.

Why RLVR + synthetic data work for customizing Agentic AI

This combination creates a powerful synergy:

Component	Role	Why it matters
NeMo Data Designer	Generates realistic and diverse, structured AI training data with built-in validation	Solves the cold-start problem—you can train without waiting for real usage data
NeMo Gym	Provides the training environment with CLI tools and verifiable reward logic	Defines what actions are valid and how success is measured
Unsloth for RLVR + GRPO	Executes efficient GRPO training with 80% less VRAM	Makes RL training accessible on a single GPU while maintaining quality
Human approval loop	Serves as the final safety gate, keeping users in control	Maintains trust—users always have the final say before any action occurs

Table 1. What role each component of the training pipeline plays, and why it matters to the problem

The result: We can teach Nemotron-Nano-9B-V2 to precisely and safely operate a new CLI tool—all without full retraining or compromising on safety.

Closing thoughts

By extending our Bash operator into a LangGraph-aware computer-use agent, we’ve demonstrated how synthetic data generation and RLVR (with GRPO) form a powerful recipe for rapidly specializing large reasoning models to new toolchains.

The workflow generalizes cleanly to any CLI tool:

Use NeMo Data Designer to define structured, verifiable examples
Build a NeMo Gym environment with your CLI tools and verification logic
Fine-tune efficiently with Unsloth’s GRPO
Maintain human-in-the-loop execution for safety

This pattern lets you turn any capable large language model (LLM) into a domain-specific, verifiably safe computer-use agent—from LangGraph today to your proprietary internal tools tomorrow.

The implications are significant: Instead of waiting months to collect training data or accepting the risks of uncontrolled command generation, you can deploy specialized, safe CLI agents in days. Whether you’re automating DevOps workflows, creating customer support tools, or building internal productivity agents, this approach provides a fast, safe path from idea to production.

Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.

Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model.
Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com.
Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord
Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron

How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

What you’ll build: a specialized agent to run a new CLI tool

Why use synthetic data generation and reinforcement learning to teach a new CLI?

Prerequisites

Step 1: Design a synthetic dataset with NeMo Data Designer

Why use synthetic data generation?

The dataset structure

The validation process

Step 2: Fine-tune with RLVR (using GRPO)

Reinforcement Learning with Verifiable Rewards (RLVR)

Building the training environment with NeMo Gym

Optimization via Group Relative Policy Optimization (GRPO)

Step 3: Human-in-the-loop execution

The safety architecture

The complete safety chain

Why RLVR + synthetic data work for customizing Agentic AI

Closing thoughts

Tags

About the Authors

How to Train an AI Agent for Command-Line Tasks with Synthetic Data and Reinforcement Learning

What you’ll build: a specialized agent to run a new CLI tool

Why use synthetic data generation and reinforcement learning to teach a new CLI?

Prerequisites

Step 1: Design a synthetic dataset with NeMo Data Designer

Why use synthetic data generation?

The dataset structure

The validation process

Step 2: Fine-tune with RLVR (using GRPO)

Reinforcement Learning with Verifiable Rewards (RLVR)

Building the training environment with NeMo Gym

Optimization via Group Relative Policy Optimization (GRPO)

Step 3: Human-in-the-loop execution

The safety architecture

The complete safety chain

Why RLVR + synthetic data work for customizing Agentic AI

Closing thoughts

Tags

About the Authors

Comments

Related posts

Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk

Create Your Own Bash Computer Use Agent with NVIDIA Nemotron in One Hour

Turbocharging AI Factories with DPU-Accelerated Service Proxy for Kubernetes

Agentic Autonomy Levels and Security

Deep Reinforcement Learning Agent Beats Atari Games

Related posts

Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 New Digital Twin Products Developers Can Use to Build 6G Networks

How to Build License-Compliant Synthetic Data Pipelines for AI Model Distillation

Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk

Updating Classifier Evasion for Vision Language Models