Autonomous AI agents are taking on all types of work for businesses: routing logistics fleets, triaging support tickets, generating code, and orchestrating multistep workflows. How do you take a general-purpose model and make it excel at your specific task? Customization provides an agent with the right capabilities.
This post explains nine techniques for customizing AI agents, along with criteria for selecting the right techniques for your use case. To learn about evaluating AI agents, see Mastering Agentic Techniques: AI Agent Evaluation.
Why is it necessary to customize an AI agent?
Foundation models come with broad language and reasoning capabilities across use cases and modalities based on the training datasets used. Models understand language and can follow instructions, but specialized workflows often require context that is restricted, specialized, or proprietary.
Customizing an agent solves this challenge by shaping how the agent reasons under constraints, which tools it selects, how it structures its outputs, and how reliably it executes domain workflows.
What techniques are used for agent customization?
Agent customization techniques span from simple prompt changes to advanced techniques like reinforcement learning (RL), each with tradeoffs in cost, complexity, and capability. The best approach depends on whether you need better information, instructions, or fundamentally more reliable behavior. The following sections cover the main approaches.
Prompt engineering and system prompts
Prompt engineering only requires changing the prompt to the agent at inference time. It’s the most accessible and typically the first technique applied to customize agent behavior. Standard agents may require human tuning of system prompts. Advanced, self-evolving agents like OpenClaw use prompts that get updated by the agent itself as it revises memory and instructions over time, resulting in a self-customizing agent.
How it works
You write a system prompt that defines the agent’s role, available tools, output format, and behavioral constraints. The model follows these instructions using its existing capabilities.
The following is a sample system prompt:
You are an expert CLI assistant. Translate user requests into structured JSON tool
calls. Respond with ONLY a JSON object. Set unused flags to null.
When to use
- Iterating quickly on agent behavior
- Working on a custom task that is described clearly in natural language
- Prototyping or experimenting before investing further
Limitations
- Prompts can become brittle for complex reasoning chains
- Performance degrades as instructions grow longer, more nuanced
- Model may not consistently follow complex formatting requirements
- Doesn’t extend the model’s core capabilities
- Switching the model powering the agent requires retesting prompts
Every agent project requires iterative prompt engineering and refinement. However, getting the agent to reliably produce structured outputs, follow domain-specific logic, or handle edge cases may require refinement. Note that self-evolving agents refine their own prompts using a harness.
Retrieval-augmented generation
Retrieval-augmented generation (RAG) solves the knowledge limitation of foundation models by dynamically retrieving relevant, up-to-date information from external knowledge sources (vector databases, for example). This retrieved content grounds the agent at inference time, when it is injected into the model’s context. This significantly reduces hallucinations and enables answering questions about custom, proprietary, or rapidly changing domains without model retraining.
How it works
When a user queries the agent system, the system searches a vector database or document store for data relevant to the query. Retrieved content is then sent alongside the user query to the model, which reasons over both and returns a grounded response.
When to use
- Giving agent access to up-to-date or proprietary knowledge
- Reducing hallucinations by grounding responses in authoritative sources
- Working with a knowledge base that changes frequently and retraining would be impractical
Limitations
- Adds latency due to retrieval
- Doesn’t add new reasoning capabilities, only new information to reason about
- Context window limits constrain how much retrieved information can be used
Standard RAG is increasingly evolving into agentic RAG, where the agent autonomously decides which documents to retrieve, which queries to reformulate, and when it has gathered enough information. For an interactive coding experience within your browser, check out the How to Build an Agentic RAG Application learning module.
Agent tool and skill injection
Tool and skill injection extends an agent’s capabilities by providing the agent with tools or skills:
- Tools: Callable functions that interact with external software
- Skills: Domain-specific instructions for completing tasks
These modular, reusable components make it easy to customize a general-purpose model for specialized domains without modifying its underlying weights.
How it works
Tools such as web search, file I/O, shell execution, and API calls are defined in the agent’s system prompt or context. Skills, which may include instructions, scripts, and resources, are loaded into the agent’s context.
The following example file directory is where a skill might be located for incident triage:
skills/
incident-triage/
SKILL.md
README.md
scripts/
collect_logs.sh
parse_logs.py
summarize_findings.py
templates/
triage_report.md
examples/
sample_incident.json
The SKILL.md might look like the following:
# Skill: Incident Triage (Log Collection + Summary)
## Purpose
Collect diagnostic logs for a given service, extract key error signals, and produce a short
triage report with:
- suspected root cause(s)
- top error signatures
- timeline highlights
- immediate next steps
## When to Use
Use this skill when the user asks to:
- investigate an outage / regression
- summarize logs for a service between two timestamps
- produce a quick incident report
## Inputs (Required)
- service_name: string (e.g., "payments-api")
- start_time: ISO8601 string (e.g., "2026-03-05T10:00:00Z")
- end_time: ISO8601 string (e.g., "2026-03-05T11:00:00Z")
## Inputs (Optional)
- environment: string (default "prod")
- log_source: string (default "journald") # could be "file", "cloud", etc.
- output_dir: string (default "./out")
- redact: boolean (default true)
## Outputs
- {output_dir}/raw_logs.txt
- {output_dir}/events.jsonl
- {output_dir}/summary.md
## Workflow
1) Collect logs:
- Run `scripts/collect_logs.sh` to fetch raw logs for the time window
2) Parse logs into structured events:
- Run `scripts/parse_logs.py` to emit JSONL events (timestamp, level, message, signature)
3) Summarize:
- Run `scripts/summarize_findings.py` to produce a markdown report using `templates/triage_report.md`
## Commands (How to Call)
### Step 1: Collect
```bash
bash scripts/collect_logs.sh \
--service payments-api \
--start "2026-03-05T10:00:00Z" \
--end "2026-03-05T11:00:00Z" \
--env prod \
--out ./out/raw_logs.txt
```
When to use
- Extending what an agent can do, not how it reasons
- Connecting your agent system to external software, APIs, or other third-party components
- Providing agent with modular, composable capabilities
Limitations
- Model requires tool-calling as a base capability
- Complex tool orchestration may require fine-tuning for reliability
- Skill definitions consume context window space
Supervised fine-tuning
Supervised fine-tuning (SFT) is for modifying a pretrained model’s behavior by tuning model weights with labelled datasets. Unlike previous techniques that customize agent behavior at inference time, SFT is performed at training time, modifying the underlying model’s behavior.
How it works
You assemble a dataset of examples—each containing an input (a natural language request) and the ideal output (such as a structured JSON tool call). The model trains on these examples, learning to replicate the demonstrated behavior.
Synthetic data generation (SDG) tools like NVIDIA NeMo Data Designer can accelerate this process, especially in low-resource domains where manually labelled examples are scarce. Instead of hand-authoring every training example, teams can define a data schema and use LLMs to generate diverse, high-quality training pairs. Then, conduct SFT using that generated dataset using an advanced fine-tuning framework like NVIDIA NeMo framework.
When to use
- Working with accessible data for well-defined tasks with output examples
- Customizing a model for a low-resource domain where labelled examples are limited, and high-quality synthetic data can be generated to bootstrap the fine-tuning dataset
- Requiring the model to reliably produce specific output formats (JSON schemas, tool calls, structured data)
Limitations
- Quality depends entirely on training data quality; the model learns to imitate, for better or worse
- May overfit to training distribution if data isn’t diverse enough (catastrophic forgetting)
- Needs compute resources for training
SFT is often the first training-based step in an agent customization pipeline. It establishes a baseline behavior that downstream alignment methods can refine.
Parameter-efficient fine-tuning
Full fine-tuning, such as on a 9-billion-parameter model, requires significant GPU resources to tune all weights. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA), describe a type of update mechanism that can be used with SFT to freeze the majority of model weights while only modifying a tiny fraction of parameters.
This approach maintains most of the benefits of full training while drastically reducing storage overhead for multiple specialized AI models. PEFT is now the standard for practical agent fine-tuning.
How it works
LoRA injects small, trainable matrices into the model’s attention and feed-forward layers. Instead of updating all parameters in a large model, you only train a small fraction. For example, NVIDIA Nemotron 3 Nano has 30 billion total parameters with ~3.5 billion active per forward pass. With LoRA, the large base model stays the same, and you swap in different adapters for different tasks, domains, or customers.
QLoRA extends this by quantizing the base model to 4-bit precision, enabling fine-tuning of models that would otherwise exceed available GPU memory. In practice, choosing SFT using LoRA is a fast path to useful customization without the full cost of fine-tuning.
A model that would require multiple high-end GPUs for full fine-tuning can often be LoRA-tuned on a single GPU. This democratizes customization for teams without massive compute budgets.
When to use
- Working with limited GPU resources
- Maintaining multiple specialized versions of a base model
- Requiring quick iterations and fast training cycles
Limitations
- Retraining a subsection of the model weights limits the degree of change possible (quality ceiling)
Direct Preference Optimization
While SFT imitates good examples, Direct Preference Optimization (DPO) trains the model on pairwise preference comparisons. The preference signal can come from human annotators, an LLM judge, rule-based verifiers, or synthetically generated preference data, since DPO is agnostic to the source of the preference signal. Preference signals eliminate the need for a separate reward model, unlike reinforcement learning from human feedback (RLHF), making DPO effective as a refinement step after an SFT baseline exists.
How it works
You collect or generate pairs of responses for the same input: one preferred and one rejected. These pairs can be produced manually, curated from real user interactions, or generated with synthetic data generation workflows.
For example, in a low-resource domain, an LLM can generate candidate responses and preference labels according to a rubric, schema, or verifier, then humans can review or sample-audit the results for quality. The DPO algorithm assigns higher probability to preferred responses using a pairwise contrastive loss, maximizing the relative log-probability of the preferred response over the rejected one.
When to use
- Using subjective response quality (tone, style, helpfulness, safety)
- Working with multiple valid outputs but some are measurably better than others
- Requiring alignment with preferences without the complexity of full RLHF
- Refining output quality further after performing SFT
Limitations
- Requires high-quality preference pairs, whether human-authored or synthetic
- Synthetic preference data can encode judge bias, weak rubrics, or unrealistic examples if not validated
- Less effective for tasks with strictly verifiable correct answers
Reinforcement learning
Reinforcement learning (RL) techniques comprise a subclass of machine learning. The following techniques covered are variations of RL that can be used specifically to customize agents, and the LLMs that power them.
Reinforcement learning from human feedback
RLHF is one of the most powerful yet resource-intensive techniques for aligning language models with human preferences. It uses a two-stage process: first, training a reward model (a separate neural network) to predict human preference, and then using that model as an automated judge to score outputs during RL training. This helps capture nuanced quality criteria like tone, helpfulness, and safety.
How it works
Human annotators rank model outputs by quality. These rankings train a reward model that predicts human preferences. The agent is then trained using a RL algorithm to maximize the reward model’s scores while staying close to its original behavior.
When to use
- Coordinating complex alignment objectives that can’t be captured by simple metrics
- Working with substantial human annotation resources
- Requiring nuanced behavioral shaping (safety, helpfulness, harm avoidance)
Limitations
- Complex implementation—requires managing multiple models simultaneously (e.g. policy, reference, reward, critic)
- Computationally expensive and prone to training instabilities
- Reward model can be gamed or misspecified (reward hacking)
Reinforcement learning with verifiable rewards
RLHF-style approaches rely on learned reward models, which are expensive to train and can be imprecise or gameable. The process and system of designing reward models is extensive. For tasks with clear right/wrong answers—like valid JSON, correct API calls, or passing tests—reinforcement learning with verifiable rewards (RLVR) can provide auditable, repeatable reward signals from reliable verifiers that reduce some of the ambiguity that derives from these learned reward models.
How it works
Instead of training a reward model from human preferences, RLVR uses deterministic verification functions that can objectively and transparently assess an output’s correctness.
Consider an agent trained to translate natural language into CLI commands. A verification function parses the model’s JSON output, checks whether the command is correct, compares each flag against the expected values, and computes a precise reward score:
- Exact match: Reward = +1.0
- Correct command, partial flags: Reward proportional to flag accuracy
- Wrong command or invalid JSON: Reward = -1.0
This approach is used by NVIDIA NeMo Gym, which provides verification endpoints that score model outputs against ground truth during training.
When to use
- Working with a task that has objectively verifiable correct outputs (structured data, CLI commands, code, mathematical reasoning, tool calls)
- Requiring transparent, auditable reward signals
- Needing to improve reasoning quality, beyond surface-level answering capabilities
Limitations
- Only applicable to tasks with deterministic correctness criteria
- Not suitable for creative, subjective, or open-ended generation
- Requires building verification infrastructure (though frameworks like NeMo Gym simplify this)
RLVR is a key technique behind DeepSeek-R1’s breakthrough reasoning capabilities, demonstrating that verifiable rewards can teach models sophisticated problem-solving strategies—sometimes even without any supervised fine-tuning as a starting point. Open libraries such as NVIDIA NeMo RL and NeMo Gym help developers train at scale.
Group Relative Policy Optimization
Group Relative Policy Optimization (GRPO) is an efficient policy optimization algorithm that pairs naturally with RLVR. It generates multiple completions per prompt and replaces PPO’s critic network with a group-relative baseline to guide improvement. This cuts computational overhead, keeping training stable and effective.
How it works
For each training prompt, GRPO generates multiple completions (typically 4 to 64) from the current policy. Each completion is scored by the reward function. Instead of using a critic network to estimate baselines (as PPO does), GRPO computes the advantage of each completion by normalizing its reward against the group mean and standard deviation. Completions with above-average advantage are reinforced; those below are suppressed.
When to use
- Applying RLVR and need an efficient optimization algorithm
- Working with computational resources that are a constraint
- Needing stable RL training without the complexity of a PPO critic
Limitations
- Requires generating multiple completions per prompt, increasing training compute per step compared to supervised methods
- Group-based baselines can be noisy with small group sizes, requiring additional tuning of the group size hyperparameter
- Effectiveness depends on a well-designed reward function; poorly specified rewards produce poor policy updates
GRPO is the optimization algorithm that powered DeepSeek-R1 training. It is increasingly becoming the default choice for RL-based agent customization, particularly when paired with verifiable rewards.
What is a multistage pipeline for AI agent customization?
In practice, the most effective agent customization combines multiple techniques in sequence. The stages of a representative pipeline are outlined below.
Stage 1: Prompt engineering plus tools and skills plus RAG
Start with system prompts, tool and skill definitions, and retrieval to establish baseline behavior.
Stage 2: SDG
For custom capabilities that prompting, tools, and vector databases alone can’t achieve, generate data to customize the agent through training.
Stage 3: SFT
SFT teaches the model the basic vocabulary, format, and structure of custom tasks.
Stage 4: RLVR/GRPO or DPO
Refine the SFT model using preference-based or RL to improve quality beyond what imitation learning can achieve. The choice and ordering depends on the task:
- DPO is typically cheaper and more stable, and works well when there are preference pairs (from humans, an LLM judge, or rule-based verifiers) but no reliable scalar reward.
- RLVR with GRPO is the right tool when outputs are objectively verifiable and there is a need to push reasoning quality further than preference learning alone can reach.
These aren’t strict alternatives. A common pattern is SFT → DPO → RLVR. DPO is used first to affordably align format and style on top of the SFT policy, then RLVR drives harder reasoning gains where verifiable rewards exist. The order is a design choice, not a fixed recipe.
Stage 5: Evaluation plus iteration
Measure task success rate, tool call accuracy, and any other desired metrics. Use results to iterate on customization stages until you achieve desired performance.
This pipeline reflects a principle that the field is converging on: start lightweight, measure rigorously, and add complexity only where the data shows it’s needed.
How to choose the right agent customization approach
Three factors impact customization methods: task characteristics, available resources, and project maturity.
Task characteristics
If your agent’s outputs can be objectively verified (correct JSON, passing tests, valid API calls), RLVR with GRPO is likely your highest-leverage technique. If quality is subjective, DPO is more appropriate. If the task is well-defined but the model just needs examples to imitate, SFT may be sufficient.
Available resources
Full RLHF requires substantial compute and human annotation budgets. LoRA-based SFT can run on a single GPU. Prompt engineering requires no compute. Match your technique to your infrastructure.
Project maturity
Early-stage projects should invest in prompt engineering, evaluation infrastructure, and tool definitions. Training-based customization delivers the most value once you have clear metrics, identified failure modes, and sufficient data to address them.

Get started with AI agent customization
Agent customization ecompasses a spectrum of approaches that compound in effectiveness when applied thoughtfully. The most successful teams start with lightweight methods, invest early in evaluation, and layer in training-based techniques where measurement shows they’re needed.
Customization and evaluation work together to drive better outcomes. You can’t improve what you can’t measure. Every customization decision—from a prompt tweak to a GRPO training run—should be driven by clear metrics and validated against real-world performance.
Ready to customize your agents? Accelerate development with NVIDIA NeMo, which provides an integrated toolkit spanning:
- Synthetic data generation with NeMo Data Designer
- Model customization with NeMo Automodel, NeMo Megatron-Bridge, and NeMo RL
- Verifiable reward infrastructure with NeMo Gym
- Agent orchestration and evaluation with NeMo Agent Toolkit
These tools are designed to integrate with existing agent frameworks—adding customization, evaluation, and optimization capabilities without requiring you to rebuild from scratch.