Agentic AI / Generative AI

Mastering Agentic Techniques: AI Agent Customization

May 20, 2026

By Edward Li, Vanessa Bellotti and Rebecca Kao

Discuss (0)

AI-Generated Summary

Dislike

Customizing AI agents enhances their performance on specialized tasks by refining reasoning, tool selection, output structure, and workflow reliability beyond general-purpose foundation models.
Techniques for customization range from prompt engineering and retrieval-augmented generation (RAG) for quick iteration and grounding in external knowledge, to supervised fine-tuning (SFT) and parameter-efficient fine-tuning (PEFT) for training-based behavior modification.
Advanced methods like Direct Preference Optimization (DPO) and reinforcement learning with verifiable rewards (RLVR), often paired with Group Relative Policy Optimization (GRPO), provide nuanced alignment and improved reasoning by leveraging preference signals or objective correctness criteria.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Autonomous AI agents are taking on all types of work for businesses: routing logistics fleets, triaging support tickets, generating code, and orchestrating multistep workflows. How do you take a general-purpose model and make it excel at your specific task? Customization provides an agent with the right capabilities.

This post explains nine techniques for customizing AI agents, along with criteria for selecting the right techniques for your use case. To learn about evaluating AI agents, see Mastering Agentic Techniques: AI Agent Evaluation.

Why is it necessary to customize an AI agent?

Foundation models come with broad language and reasoning capabilities across use cases and modalities based on the training datasets used. Models understand language and can follow instructions, but specialized workflows often require context that is restricted, specialized, or proprietary.

Customizing an agent solves this challenge by shaping how the agent reasons under constraints, which tools it selects, how it structures its outputs, and how reliably it executes domain workflows.

What techniques are used for agent customization?

Agent customization techniques span from simple prompt changes to advanced techniques like reinforcement learning (RL), each with tradeoffs in cost, complexity, and capability. The best approach depends on whether you need better information, instructions, or fundamentally more reliable behavior. The following sections cover the main approaches.

Prompt engineering and system prompts

Prompt engineering only requires changing the prompt to the agent at inference time. It’s the most accessible and typically the first technique applied to customize agent behavior. Standard agents may require human tuning of system prompts. Advanced, self-evolving agents like OpenClaw use prompts that get updated by the agent itself as it revises memory and instructions over time, resulting in a self-customizing agent.

How it works

You write a system prompt that defines the agent’s role, available tools, output format, and behavioral constraints. The model follows these instructions using its existing capabilities.

The following is a sample system prompt:

You are an expert CLI assistant. Translate user requests into structured JSON tool 
calls. Respond with ONLY a JSON object. Set unused flags to null.

When to use

Iterating quickly on agent behavior
Working on a custom task that is described clearly in natural language
Prototyping or experimenting before investing further

Limitations

Prompts can become brittle for complex reasoning chains
Performance degrades as instructions grow longer, more nuanced
Model may not consistently follow complex formatting requirements
Doesn’t extend the model’s core capabilities
Switching the model powering the agent requires retesting prompts

Every agent project requires iterative prompt engineering and refinement. However, getting the agent to reliably produce structured outputs, follow domain-specific logic, or handle edge cases may require refinement. Note that self-evolving agents refine their own prompts using a harness.

Retrieval-augmented generation

Retrieval-augmented generation (RAG) solves the knowledge limitation of foundation models by dynamically retrieving relevant, up-to-date information from external knowledge sources (vector databases, for example). This retrieved content grounds the agent at inference time, when it is injected into the model’s context. This significantly reduces hallucinations and enables answering questions about custom, proprietary, or rapidly changing domains without model retraining.

How it works

When a user queries the agent system, the system searches a vector database or document store for data relevant to the query. Retrieved content is then sent alongside the user query to the model, which reasons over both and returns a grounded response.

When to use

Giving agent access to up-to-date or proprietary knowledge
Reducing hallucinations by grounding responses in authoritative sources
Working with a knowledge base that changes frequently and retraining would be impractical

Limitations

Adds latency due to retrieval
Doesn’t add new reasoning capabilities, only new information to reason about
Context window limits constrain how much retrieved information can be used

Standard RAG is increasingly evolving into agentic RAG, where the agent autonomously decides which documents to retrieve, which queries to reformulate, and when it has gathered enough information. For an interactive coding experience within your browser, check out the How to Build an Agentic RAG Application learning module.

Agent tool and skill injection

Tool and skill injection extends an agent’s capabilities by providing the agent with tools or skills:

Tools: Callable functions that interact with external software
Skills: Domain-specific instructions for completing tasks

These modular, reusable components make it easy to customize a general-purpose model for specialized domains without modifying its underlying weights.

How it works

Tools such as web search, file I/O, shell execution, and API calls are defined in the agent’s system prompt or context. Skills, which may include instructions, scripts, and resources, are loaded into the agent’s context.

The following example file directory is where a skill might be located for incident triage:

skills/
  incident-triage/
    SKILL.md
    README.md
    scripts/
      collect_logs.sh
      parse_logs.py
      summarize_findings.py
    templates/
      triage_report.md
    examples/
      sample_incident.json

The SKILL.md might look like the following:

# Skill: Incident Triage (Log Collection + Summary)

## Purpose
Collect diagnostic logs for a given service, extract key error signals, and produce a short
triage report with:
- suspected root cause(s)
- top error signatures
- timeline highlights
- immediate next steps

## When to Use
Use this skill when the user asks to:
- investigate an outage / regression
- summarize logs for a service between two timestamps
- produce a quick incident report

## Inputs (Required)
- service_name: string (e.g., "payments-api")
- start_time: ISO8601 string (e.g., "2026-03-05T10:00:00Z")
- end_time: ISO8601 string (e.g., "2026-03-05T11:00:00Z")

## Inputs (Optional)
- environment: string (default "prod")
- log_source: string (default "journald")  # could be "file", "cloud", etc.
- output_dir: string (default "./out")
- redact: boolean (default true)

## Outputs
- {output_dir}/raw_logs.txt
- {output_dir}/events.jsonl
- {output_dir}/summary.md

## Workflow
1) Collect logs:
   - Run `scripts/collect_logs.sh` to fetch raw logs for the time window
2) Parse logs into structured events:
   - Run `scripts/parse_logs.py` to emit JSONL events (timestamp, level, message, signature)
3) Summarize:
   - Run `scripts/summarize_findings.py` to produce a markdown report using `templates/triage_report.md`

## Commands (How to Call)
### Step 1: Collect
```bash
bash scripts/collect_logs.sh \
  --service payments-api \
  --start "2026-03-05T10:00:00Z" \
  --end "2026-03-05T11:00:00Z" \
  --env prod \
  --out ./out/raw_logs.txt
```

When to use

Extending what an agent can do, not how it reasons
Connecting your agent system to external software, APIs, or other third-party components
Providing agent with modular, composable capabilities

Limitations

Model requires tool-calling as a base capability
Complex tool orchestration may require fine-tuning for reliability
Skill definitions consume context window space

Supervised fine-tuning

Supervised fine-tuning (SFT) is for modifying a pretrained model’s behavior by tuning model weights with labelled datasets. Unlike previous techniques that customize agent behavior at inference time, SFT is performed at training time, modifying the underlying model’s behavior.

How it works

You assemble a dataset of examples—each containing an input (a natural language request) and the ideal output (such as a structured JSON tool call). The model trains on these examples, learning to replicate the demonstrated behavior.

Synthetic data generation (SDG) tools like NVIDIA NeMo Data Designer can accelerate this process, especially in low-resource domains where manually labelled examples are scarce. Instead of hand-authoring every training example, teams can define a data schema and use LLMs to generate diverse, high-quality training pairs. Then, conduct SFT using that generated dataset using an advanced fine-tuning framework like NVIDIA NeMo framework.

When to use

Working with accessible data for well-defined tasks with output examples
Customizing a model for a low-resource domain where labelled examples are limited, and high-quality synthetic data can be generated to bootstrap the fine-tuning dataset
Requiring the model to reliably produce specific output formats (JSON schemas, tool calls, structured data)

Limitations

Quality depends entirely on training data quality; the model learns to imitate, for better or worse
May overfit to training distribution if data isn’t diverse enough (catastrophic forgetting)
Needs compute resources for training

SFT is often the first training-based step in an agent customization pipeline. It establishes a baseline behavior that downstream alignment methods can refine.

Parameter-efficient fine-tuning

Full fine-tuning, such as on a 9-billion-parameter model, requires significant GPU resources to tune all weights. Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA), describe a type of update mechanism that can be used with SFT to freeze the majority of model weights while only modifying a tiny fraction of parameters.

This approach maintains most of the benefits of full training while drastically reducing storage overhead for multiple specialized AI models. PEFT is now the standard for practical agent fine-tuning.

How it works

LoRA injects small, trainable matrices into the model’s attention and feed-forward layers. Instead of updating all parameters in a large model, you only train a small fraction. For example, NVIDIA Nemotron 3 Nano has 30 billion total parameters with ~3.5 billion active per forward pass. With LoRA, the large base model stays the same, and you swap in different adapters for different tasks, domains, or customers.

QLoRA extends this by quantizing the base model to 4-bit precision, enabling fine-tuning of models that would otherwise exceed available GPU memory. In practice, choosing SFT using LoRA is a fast path to useful customization without the full cost of fine-tuning.

A model that would require multiple high-end GPUs for full fine-tuning can often be LoRA-tuned on a single GPU. This democratizes customization for teams without massive compute budgets.

When to use

Working with limited GPU resources
Maintaining multiple specialized versions of a base model
Requiring quick iterations and fast training cycles

Limitations

Retraining a subsection of the model weights limits the degree of change possible (quality ceiling)

Direct Preference Optimization

While SFT imitates good examples, Direct Preference Optimization (DPO) trains the model on pairwise preference comparisons. The preference signal can come from human annotators, an LLM judge, rule-based verifiers, or synthetically generated preference data, since DPO is agnostic to the source of the preference signal. Preference signals eliminate the need for a separate reward model, unlike reinforcement learning from human feedback (RLHF), making DPO effective as a refinement step after an SFT baseline exists.

How it works

You collect or generate pairs of responses for the same input: one preferred and one rejected. These pairs can be produced manually, curated from real user interactions, or generated with synthetic data generation workflows.

For example, in a low-resource domain, an LLM can generate candidate responses and preference labels according to a rubric, schema, or verifier, then humans can review or sample-audit the results for quality. The DPO algorithm assigns higher probability to preferred responses using a pairwise contrastive loss, maximizing the relative log-probability of the preferred response over the rejected one.

When to use

Using subjective response quality (tone, style, helpfulness, safety)
Working with multiple valid outputs but some are measurably better than others
Requiring alignment with preferences without the complexity of full RLHF
Refining output quality further after performing SFT

Limitations

Requires high-quality preference pairs, whether human-authored or synthetic
Synthetic preference data can encode judge bias, weak rubrics, or unrealistic examples if not validated
Less effective for tasks with strictly verifiable correct answers

Reinforcement learning

Reinforcement learning (RL) techniques comprise a subclass of machine learning. The following techniques covered are variations of RL that can be used specifically to customize agents, and the LLMs that power them.

Reinforcement learning from human feedback

RLHF is one of the most powerful yet resource-intensive techniques for aligning language models with human preferences. It uses a two-stage process: first, training a reward model (a separate neural network) to predict human preference, and then using that model as an automated judge to score outputs during RL training. This helps capture nuanced quality criteria like tone, helpfulness, and safety.

How it works

Human annotators rank model outputs by quality. These rankings train a reward model that predicts human preferences. The agent is then trained using a RL algorithm to maximize the reward model’s scores while staying close to its original behavior.

When to use

Coordinating complex alignment objectives that can’t be captured by simple metrics
Working with substantial human annotation resources
Requiring nuanced behavioral shaping (safety, helpfulness, harm avoidance)

Limitations

Complex implementation—requires managing multiple models simultaneously (e.g. policy, reference, reward, critic)
Computationally expensive and prone to training instabilities
Reward model can be gamed or misspecified (reward hacking)

Reinforcement learning with verifiable rewards

RLHF-style approaches rely on learned reward models, which are expensive to train and can be imprecise or gameable. The process and system of designing reward models is extensive. For tasks with clear right/wrong answers—like valid JSON, correct API calls, or passing tests—reinforcement learning with verifiable rewards (RLVR) can provide auditable, repeatable reward signals from reliable verifiers that reduce some of the ambiguity that derives from these learned reward models.

How it works

Instead of training a reward model from human preferences, RLVR uses deterministic verification functions that can objectively and transparently assess an output’s correctness.

Consider an agent trained to translate natural language into CLI commands. A verification function parses the model’s JSON output, checks whether the command is correct, compares each flag against the expected values, and computes a precise reward score:

Exact match: Reward = +1.0
Correct command, partial flags: Reward proportional to flag accuracy
Wrong command or invalid JSON: Reward = -1.0

This approach is used by NVIDIA NeMo Gym, which provides verification endpoints that score model outputs against ground truth during training.

When to use

Working with a task that has objectively verifiable correct outputs (structured data, CLI commands, code, mathematical reasoning, tool calls)
Requiring transparent, auditable reward signals
Needing to improve reasoning quality, beyond surface-level answering capabilities

Limitations

Only applicable to tasks with deterministic correctness criteria
Not suitable for creative, subjective, or open-ended generation
Requires building verification infrastructure (though frameworks like NeMo Gym simplify this)

RLVR is a key technique behind DeepSeek-R1’s breakthrough reasoning capabilities, demonstrating that verifiable rewards can teach models sophisticated problem-solving strategies—sometimes even without any supervised fine-tuning as a starting point. Open libraries such as NVIDIA NeMo RL and NeMo Gym help developers train at scale.

Group Relative Policy Optimization

Group Relative Policy Optimization (GRPO) is an efficient policy optimization algorithm that pairs naturally with RLVR. It generates multiple completions per prompt and replaces PPO’s critic network with a group-relative baseline to guide improvement. This cuts computational overhead, keeping training stable and effective.

How it works

For each training prompt, GRPO generates multiple completions (typically 4 to 64) from the current policy. Each completion is scored by the reward function. Instead of using a critic network to estimate baselines (as PPO does), GRPO computes the advantage of each completion by normalizing its reward against the group mean and standard deviation. Completions with above-average advantage are reinforced; those below are suppressed.

When to use

Applying RLVR and need an efficient optimization algorithm
Working with computational resources that are a constraint
Needing stable RL training without the complexity of a PPO critic

Limitations

Requires generating multiple completions per prompt, increasing training compute per step compared to supervised methods
Group-based baselines can be noisy with small group sizes, requiring additional tuning of the group size hyperparameter
Effectiveness depends on a well-designed reward function; poorly specified rewards produce poor policy updates

GRPO is the optimization algorithm that powered DeepSeek-R1 training. It is increasingly becoming the default choice for RL-based agent customization, particularly when paired with verifiable rewards.

What is a multistage pipeline for AI agent customization?

In practice, the most effective agent customization combines multiple techniques in sequence. The stages of a representative pipeline are outlined below.

Stage 1: Prompt engineering plus tools and skills plus RAG

Start with system prompts, tool and skill definitions, and retrieval to establish baseline behavior.

Stage 2: SDG

For custom capabilities that prompting, tools, and vector databases alone can’t achieve, generate data to customize the agent through training.

Stage 3: SFT

SFT teaches the model the basic vocabulary, format, and structure of custom tasks.

Stage 4: RLVR/GRPO or DPO

Refine the SFT model using preference-based or RL to improve quality beyond what imitation learning can achieve. The choice and ordering depends on the task:

DPO is typically cheaper and more stable, and works well when there are preference pairs (from humans, an LLM judge, or rule-based verifiers) but no reliable scalar reward.
RLVR with GRPO is the right tool when outputs are objectively verifiable and there is a need to push reasoning quality further than preference learning alone can reach.

These aren’t strict alternatives. A common pattern is SFT → DPO → RLVR. DPO is used first to affordably align format and style on top of the SFT policy, then RLVR drives harder reasoning gains where verifiable rewards exist. The order is a design choice, not a fixed recipe.

Stage 5: Evaluation plus iteration

Measure task success rate, tool call accuracy, and any other desired metrics. Use results to iterate on customization stages until you achieve desired performance.

This pipeline reflects a principle that the field is converging on: start lightweight, measure rigorously, and add complexity only where the data shows it’s needed.

How to choose the right agent customization approach

Three factors impact customization methods: task characteristics, available resources, and project maturity.

Task characteristics

If your agent’s outputs can be objectively verified (correct JSON, passing tests, valid API calls), RLVR with GRPO is likely your highest-leverage technique. If quality is subjective, DPO is more appropriate. If the task is well-defined but the model just needs examples to imitate, SFT may be sufficient.

Available resources

Full RLHF requires substantial compute and human annotation budgets. LoRA-based SFT can run on a single GPU. Prompt engineering requires no compute. Match your technique to your infrastructure.

Project maturity

Early-stage projects should invest in prompt engineering, evaluation infrastructure, and tool definitions. Training-based customization delivers the most value once you have clear metrics, identified failure modes, and sufficient data to address them.

Get started with AI agent customization

Agent customization ecompasses a spectrum of approaches that compound in effectiveness when applied thoughtfully. The most successful teams start with lightweight methods, invest early in evaluation, and layer in training-based techniques where measurement shows they’re needed.

Customization and evaluation work together to drive better outcomes. You can’t improve what you can’t measure. Every customization decision—from a prompt tweak to a GRPO training run—should be driven by clear metrics and validated against real-world performance.

Ready to customize your agents? Accelerate development with NVIDIA NeMo, which provides an integrated toolkit spanning:

Synthetic data generation with NeMo Data Designer
Model customization with NeMo Automodel, NeMo Megatron-Bridge, and NeMo RL
Verifiable reward infrastructure with NeMo Gym
Agent orchestration and evaluation with NeMo Agent Toolkit

These tools are designed to integrate with existing agent frameworks—adding customization, evaluation, and optimization capabilities without requiring you to rebuild from scratch.

Discuss (0)

About the Authors

About Edward Li
Edward Li is a technical marketing engineer with NVIDIA Enterprise Computing. He is a recent graduate of the University of Pennsylvania School of Engineering and Applied Science. He holds a bachelor’s degree and a master’s degree in Computer Science with a concentration in Data Science. At NVIDIA, Edward is passionate about data science, AI, and ML and is working on solutions to bring generative AI to enterprises.

View all posts by Edward Li

About Vanessa Bellotti
Vanessa Bellotti is a technical marketing engineer in the NVIDIA Enterprise Products Group. She is a recent graduate of the Tufts University School of Engineering. She holds a bachelor’s degree in Computer Science, a minor in Mathematics, and is working towards a Master’s in Artificial Intelligence from Johns Hopkins Whiting School of Engineering. At NVIDIA, Vanessa is working on solutions to bring generative AI to enterprises and is passionate about ML, AI, and data science.

View all posts by Vanessa Bellotti

About Rebecca Kao
Rebecca Kao is a product marketing director of AI software at NVIDIA, focused on bringing agentic AI products to market. She joined from Gretel, where she was the VP of marketing, and led a team promoting synthetic data generation for AI model training. Prior to this role, she served as the head of marketing at HEAVY.ai, a GPU-accelerated analytics platform, and director of marketing Analytics at Ogilvy & Mather Singapore.

View all posts by Rebecca Kao

Comments are closed.

Mastering Agentic Techniques: AI Agent Customization

Why is it necessary to customize an AI agent?

What techniques are used for agent customization?

Prompt engineering and system prompts

How it works

When to use

Limitations

Retrieval-augmented generation

How it works

When to use

Limitations

Agent tool and skill injection

How it works

When to use

Limitations

Supervised fine-tuning

How it works

When to use

Limitations

Parameter-efficient fine-tuning

How it works

When to use

Limitations

Direct Preference Optimization

How it works

When to use

Limitations

Reinforcement learning

Reinforcement learning from human feedback

How it works

When to use

Limitations

Reinforcement learning with verifiable rewards

How it works

When to use

Limitations

Group Relative Policy Optimization

How it works

When to use

Limitations

What is a multistage pipeline for AI agent customization?

Stage 1: Prompt engineering plus tools and skills plus RAG

Stage 2: SDG

Stage 3: SFT

Stage 4: RLVR/GRPO or DPO

Stage 5: Evaluation plus iteration

How to choose the right agent customization approach

Task characteristics

Available resources

Project maturity

Get started with AI agent customization

Tags

About the Authors

Comments