What if your computer-use agent could learn a new Command Line Interface (CLI)—and operate it safely without ever writing files or free-typing shell commands?
In Part 1 of our series on building a computer use agent, we built a custom Bash computer-use agent using NVIDIA Nemotron in just one hour. In this sequel, we’ll take it further by teaching the same reasoning model with no prior knowledge to safely operate the LangGraph Platform CLI. This shows how easily a large reasoning model can be specialized to perform new, agentic tasks.
Instead of simple file operations, our new agent will learn to start local servers, build containers, and generate Dockerfiles—entirely through a verifiable, human-in-the-loop command interface.
We’ll combine synthetic data generation (SDG) and Reinforcement Learning with Verifiable Rewards (RLVR), optimized via Group Relative Policy Optimization (GRPO), to make training both efficient and safe.
What you’ll build: a specialized agent to run a new CLI tool
You’ll fine-tune an AI agent that can:
- Propose valid LangGraph CLI commands (e.g., langgraph dev –port 8123 –no-browser)
- Ask for explicit human confirmation before executing
- Learn new subcommands from synthetic seed data
- Train efficiently on a single GPU using RLVR
Here’s what a typical interaction looks like once the model is trained:
[🙂] Bring the LangGraph server online.
[🤖] I can execute:
[COMMAND]
["langgraph", "up", "--wait"]
[CONFIRM]
Run this command now? (yes/no)
▶️ Execute `langgraph up --wait`? [y/N]: y
[🤖] Result:
Server started successfully on port 8000.
This pattern generalizes: The same workflow can be extended to support new CLI tools and environments.
Why use synthetic data generation and reinforcement learning to teach a new CLI?
Teaching an AI agent to operate a specialized CLI tool presents unique challenges that traditional approaches struggle with:
The data scarcity problem: Most specialized CLI tools lack the massive usage logs needed for conventional training. Unlike common shell commands, tools like LangGraph have specific syntax, flags, and workflows that aren’t well-represented in general training data. Waiting to collect real-world usage examples could take months or years.
The safety-accuracy tradeoff: You want your agent to be creative in understanding user intent, but absolutely precise when generating commands. A single typo or wrong flag could cause system errors or worse. Traditional fine-tuning often produces models that are either too conservative (refusing valid requests) or too permissive (hallucinating dangerous commands).
How SDG + RL solves this:
- Synthetic data generation lets you bootstrap high-quality training examples from just a handful of seed commands, ensuring complete coverage of the CLI’s capabilities.
- Reinforcement learning with verifiable rewards teaches the model to consistently produce syntactically correct commands by rewarding valid outputs and penalizing errors.
- Together, they create a virtuous cycle: SDG provides diverse training scenarios, while RLVR ensures the model learns to handle them correctly.
This approach is particularly powerful for enterprise environments where you might need to quickly adapt agents to proprietary internal tools without waiting for organic data collection.
Prerequisites
For this setup, you’ll need:
Hardware requirements:
- Access to an NVIDIA GPU with at least 80 GB memory (e.g., A100)
- Minimum 32 GB system RAM
- 100 GB free disk space for model weights and datasets
Software requirements:
- Python 3.10 or newer
- CUDA 12.0+ and appropriate NVIDIA drivers
Core components:
- LangGraph – The target CLI tool our agent will learn to operate
- NeMo Gym – For building the RL training environment with tools and verifiable rewards
- Unsloth – For efficient GRPO-based reinforcement learning with reduced VRAM requirements
- NeMo Data Designer – For generating synthetic training data
Base model:
- Nemotron-Nano-9B-V2 – Available on Hugging Face
- Installation and usage instructions are provided in the linked documentation
Check out a video version of this tutorial:
Video 1. Use SDG and RL to produce a LangGraph CLI BASH Agent.
Step 1: Design a synthetic dataset with NeMo Data Designer
Before training, we need data: pairs of natural-language requests mapped to LangGraph CLI invocations.
We’ll use the NVIDIA NeMo Data Designer to programmatically generate this dataset, starting from a handful of seed examples and expanding into hundreds of verified command pairs.
Why use synthetic data generation?
Think of synthetic data generation like teaching someone a new language by showing them a pattern, then having them create variations. Instead of collecting thousands of real examples (which might not exist yet), we:
- Provide a few high-quality “seed” examples
- Use an AI model to generate diverse variations
- Validate each generated example against strict rules
- Build a comprehensive dataset in hours instead of months
The dataset structure
| User request | CLI command | Confirmation |
|---|---|---|
| “Start a local dev server on port 8123.” | langgraph dev –port 8123 –no-browser | “Proceed with this command? (yes/no)” |
| “Build the project image for both amd64 and arm64.” | langgraph build -t my-graph:multi –platform linux/amd64,linux/arm64 | “Run build now?” |
Each generated record includes:
- User request: Natural language that a human might actually type
- CLI command: The exact, syntactically correct command to execute
- Confirmation prompt: A safety check before execution
The validation process
In Data Designer, we steer diversity with sampling parameters and reject any record that fails validation. For example, we might use a regex pattern like:
^langgraph\s+(dev|build|up|dockerfile)\b
This ensures that:
- Every command starts with langgraph
- Only approved subcommands are used
- The syntax is always valid
Finally, we export the dataset in OpenAI-style messages format—ideal for RLVR fine-tuning with the open-source NVIDIA NeMo framework.
This validation process matters: It guarantees that the reward verifier (introduced later) will be consistent with the structure and syntax of the training data.
Let’s look at the implementation in NeMo Data Designer.
# Define seed distributions
command = Sampler(["new", "dev", "up", "build", "dockerfile"])
port = Sampler(range(3000, 9000))
template = Sampler(["react-agent", "memory-agent", "retrieval-agent"])
# Generate natural language input
user_request = LLM(
prompt=f"Write a request to {command} with {template} on port {port}",
model="nemotron-3-nano-30b-a3b"
)
# Generate structured output
tool_call = LLM(
prompt=f"Convert '{user_request}' to CLI JSON.",
schema=CLIToolCall,
model="nemotron-3-nano-30b-a3b"
)
Step 2: Fine-tune with RLVR (using GRPO)
With clean, verified data in hand, we move to fine-tuning using Unsloth, an open source framework for efficient reinforcement learning that integrates with NeMo Gym training environments
Reinforcement Learning with Verifiable Rewards (RLVR)
Traditional reinforcement learning from human feedback (RLHF) is like having a panel of judges score each output—subjective, expensive, and inconsistent. RLVR replaces human judges with deterministic code-based verification.
Instead of asking humans “Does this command look good?,” we ask code “Does this command pass our validation rules?”
For a CLI agent, the verifier enforces rules such as:
- Output must start with langgraph
- Only approved subcommands and flags allowed
- No commentary, punctuation, or unsafe tokens
The reward system:
✅ Valid command → +1 reward (encourages this behavior)
❌ Invalid command → −1 reward (discourages this behavior)
⚪ Ambiguous output → 0 reward (neutral, no reinforcement)
This consistency is crucial: The same output always yields the same reward, making training stable and predictable. And because the verifier is just code, you can adjust constraints anytime without retraining a separate reward model.
Building the training environment with NeMo Gym
NeMo Gym is an open source library for building reinforcement learning training environments for LLMs. It provides the infrastructure to define tools, execute agent actions, and compute verifiable rewards—exactly what we need for training a CLI agent.
The CLI agent environment is implemented as a NeMo Gym resource server, which encapsulates:
- Tool definitions – The CLI commands the agent can propose
- Verification logic – Rules that check command validity and correctness
- Reward computation – Scores (0.0 to 1.0) returned to the RL training loop
When the agent proposes commands, the resource server evaluates correctness and returns reward signals for GRPO training. This clean separation between environment logic and training framework means you can iterate on your CLI tools and validation rules without touching the RL code.
To learn more about creating custom environments, see the NeMo Gym documentation and the guide on creating resource servers.
Optimization via Group Relative Policy Optimization (GRPO)
GRPO is a simpler, more memory-efficient alternative to PPO. Instead of training a separate “critic” model to estimate how good each action is, GRPO samples multiple outputs for the same prompt and uses their average reward as the baseline. This cuts the model count in half (no critic needed) and reduces variance by comparing outputs against each other rather than against a learned estimate.
Here’s how it works in practice:
Traditional RL might struggle when most attempts fail. Imagine the model generates 10 command variations for the same prompt:
- Nine are invalid (reward = 0)
- One is valid (reward = 1)
Standard optimization might get lost in the noise of failures. GRPO instead:
- Groups all responses to the same prompt together
- Computes relative advantages within each group
- Strongly reinforces that one success, making it stand out from the failures
This approach dramatically improves learning efficiency and convergence speed, helping the model quickly learn what makes a command valid.
Let’s see how we’d implement this with Unsloth and NeMo Gym:
# The "Verifiable Reward" Function
def compute_reward(agent_output, expected):
try:
cmd = json.loads(agent_output)
# Hard Rule: Command must match expectation
if cmd.name != expected.name:
return -1.0 # Penalize hallucinations
# Soft Rule: Flags must be accurate
accuracy = calculate_flag_accuracy(cmd.flags, expected.flags)
return accuracy
except JSONDecodeError:
return -1.0 # Penalize broken syntax
# Start GRPO Training
grpo.train(
model="nemotron-nano-9B-v2",
algorithm="GRPO",
env=compute_reward,
dataset=synthetic_data
)
Step 3: Human-in-the-loop execution
Once fine-tuned, we embed the model into a runtime loop that always requests human confirmation before execution. This maintains the safety architecture introduced in Part 1, ensuring no command runs without explicit approval.
The safety architecture
subprocess.run(argv, shell=False)
This simple line embodies a crucial security principle. By setting shell=False, we ensure:
- Commands execute as discrete argument lists (e.g., [“langgraph”, “up”, “–wait”])
- Shell metacharacters like &&, ;, or | are treated as literal text, not operators
- Command injection attacks become impossible
The complete safety chain
Our multi-layered approach ensures safety at every step:
- Training-time safety: RLVR ensures the model learns to generate valid commands.
- Runtime verification: A validator checks every proposed command against allowlists.
- Human confirmation: Users must explicitly approve each command before execution.
- Execution isolation: Commands run without shell interpretation, preventing injection.
Even if the model occasionally produces an invalid command despite training, the runtime policy prevents it from being executed.
Why RLVR + synthetic data work for customizing Agentic AI
This combination creates a powerful synergy:
| Component | Role | Why it matters |
|---|---|---|
| NeMo Data Designer | Generates realistic and diverse, structured AI training data with built-in validation | Solves the cold-start problem—you can train without waiting for real usage data |
| NeMo Gym | Provides the training environment with CLI tools and verifiable reward logic | Defines what actions are valid and how success is measured |
Unsloth for RLVR + GRPO | Executes efficient GRPO training with 80% less VRAM | Makes RL training accessible on a single GPU while maintaining quality |
| Human approval loop | Serves as the final safety gate, keeping users in control | Maintains trust—users always have the final say before any action occurs |
The result: We can teach Nemotron-Nano-9B-V2 to precisely and safely operate a new CLI tool—all without full retraining or compromising on safety.
Closing thoughts
By extending our Bash operator into a LangGraph-aware computer-use agent, we’ve demonstrated how synthetic data generation and RLVR (with GRPO) form a powerful recipe for rapidly specializing large reasoning models to new toolchains.
The workflow generalizes cleanly to any CLI tool:
- Use NeMo Data Designer to define structured, verifiable examples
- Build a NeMo Gym environment with your CLI tools and verification logic
- Fine-tune efficiently with Unsloth’s GRPO
- Maintain human-in-the-loop execution for safety
This pattern lets you turn any capable large language model (LLM) into a domain-specific, verifiably safe computer-use agent—from LangGraph today to your proprietary internal tools tomorrow.
The implications are significant: Instead of waiting months to collect training data or accepting the risks of uncontrolled command generation, you can deploy specialized, safe CLI agents in days. Whether you’re automating DevOps workflows, creating customer support tools, or building internal productivity agents, this approach provides a fast, safe path from idea to production.
Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube.
- Visit our Nemotron developer page for all the essentials you need to get started with the most open, smartest-per-compute reasoning model.
- Explore new open Nemotron models and datasets on Hugging Face and NIM microservices and Blueprints on build.nvidia.com.
- Tune into upcoming Nemotron livestreams and connect with the NVIDIA Developer community through the Nemotron developer forum and the Nemotron channel on Discord
- Browse video tutorials and livestreams to get the most out of NVIDIA Nemotron