Agentic AI / Generative AI

An Easy Introduction to LLM Reasoning, AI Agents, and Test Time Scaling

May 23, 2025

By Tanay Varshney, Annie Surla and Chris Alexiuk

Discuss (0)

AI-Generated Summary

Dislike

LLM agents are systems that use large language models to reason through complex problems, create plans, and utilize tools or APIs to complete tasks, making them suitable for applications like smart chatbots, automated code generation, and workflow automation.
The application spaces of LLM agents can be broadly divided into chatbots and workflows, with workflows being used for tasks like processing claims in insurance and healthcare industries, and chatbots being categorized into exploratory and assistive agents.
LLM reasoning works by scaling test time compute, with techniques like chain of thought, ReAct, and self-reflection enabling models to think longer and develop deeper understanding, while techniques like Tree-of-thought and Best-of-N help models search for the best solution.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Agents have been the primary drivers of applying large language models (LLMs) to solve complex problems. Since AutoGPT in 2023, various techniques have been developed to build reliable agents across industries. The discourse around agentic reasoning and AI reasoning models further adds a layer of nuance when designing these applications. The rapid pace of this development also makes it hard for developers to jump in to build these agents, involving choosing from a multitude of design and technical choices.

To help simplify these decisions, this post covers the following broad topics:

What is an LLM agent and what are the different structural patterns to consider?
How does LLM reasoning and test time scaling work?
What are the different types of reasoning that should be considered?

What is an LLM agent?

LLM agents are systems that solve a complex problem by using an LLM to reason through a problem, create a plan and use tools or APIs to complete a task. This makes it perfect for generative AI use cases like smart chatbots, automated code generation, and workflow automation. LLM agents are just one slice of the broader AI agent landscape: the term agentic AI also covers agents powered by computer-vision models, speech models, and reinforcement learning, working in everything from customer-service chatbots to complex enterprise process orchestration to self-driving cars.

Based on the nature of execution, the application spaces of LLM agents can broadly be divided into chatbots and workflows. If you are new to agents, this article will help learn about the conceptual pieces by building your first agent!

Workflows

Robotic process automation (RPA) pipelines have traditionally been used to automate mechanical tasks, such as data entry, filing claims, and customer relationship management (CRM). These are usually designed to solve offline batch jobs that run in the background to solve robotic tasks.

These pipelines have traditionally been designed around strict rules and heuristic processes. This limits the application space for RPA pipelines and often causes issues with scaling them out.

By using LLMs, these agent pipelines can be made flexible by injecting the ability to make complex decisions and execute the appropriate tooling to solve the problem.

A prime use case where LLM agents can help revolutionize RPA pipelines would be processing claims in insurance and healthcare industries. While traditional RPA pipelines might be rigid with the data structure, LLM agents can process unstructured data in claims from diverse document formats, such as customer uploads, without explicit programming.

The agents can also adapt a dynamic workflow based on the claim, help identify potential frauds, adjust the decision-making process based on changing regulations, or help analyze complex claim scenarios to help recommend appropriate actions based on the policy and historical data.

In a workflow, agents operate in a predefined pipeline created by breaking down a complex task into definite constrained paths primarily dictated by business logic. In these cases, LLMs are used to address the ambiguity within each subtask, but the larger flow of tasks is predetermined.

A workflow diagram shows a workflow style application of agents where a complex task is broken down into simpler subtasks by the software architect, and the LLMs help resolve complexity in the minutia of individual step. — *Figure 1. A plan-and-execute-style LLM agent pipeline for CVE impact analysis*

Figure 1 shows an example of a CVE analysis workflow that helps detect vulnerabilities in shipped containers. This pipeline is well defined and made up of definite, specific subtasks.

Chatbots

Another use case of agents is AI chatbots. Based on the response latency and nature of the task they solve, these chatbots are categorized as the following:

Exploratory agents
Assistive agents

Exploratory agents are typically built to solve complex multistep tasks that are hard to solve and take time for the agent to execute. This category of agents can be considered independent agents in which the user gives tasks to expect solutions.

A great example is OpenAI’s and Perplexity’s Deep Research (Figure 2). These agents reason through a complex multistep problem and try to come up with a final solution. In these cases, users don’t expect iterative interactions. Instead, they expect a task to be completed independently. Users are typically okay with higher latencies but expect a complete solution to a complex task.

An example of an exploratory task open ended task for which agents are used — *Figure 2. Showcase of Deep Research response (source: Perplexity)*

Assistive agents inherently require a collaborative human-in-the-loop experience, where users are part of the decision-making process to validate. They are typically designed around using a narrow set of cohesive tools.

For example, these applications can be document authoring assistants, personal AI assistants, tax filing assistants, and more. These agents are built to have lower latency but solve smaller boiler plate-style problems so that users can focus on architecting the more extensive solution.

A user Interface showing a Python file open in a coding environment. The user is asking a coding assistant to modify part of the code. The assistant responded with suggested edits. — *Figure 3. Prompting a coding assistant for edits in an existing Python file*

All these agents have in common is the need to reason and create a plan to solve a task with the help of some tools (Figure 3).

A natural next question is how LLM reasoning works.

What is LLM reasoning and how does it apply to AI agents?

The Oxford Dictionary defines reasoning as, “the action of thinking about something in a logical, sensible way.” This is quite apt in the case of considering the paradigm of reasoning with LLMs.

In the past couple of years, there have been many reasoning frameworks—such as Plan and Execute, LLM compiler, and Language Agent Tree Search—and the development of reasoning models such as Llama Nemotron Super 49B and DeepSeek-R1. The question now becomes how to contextualize these developments to get a holistic view.

To that end, reasoning can be categorized broadly into the following categories:

Long thinking
Searching for the best solution
Think-critique-improve

There are three broad categories of reasoning - Think Longer, Diverse Thinking and Critique thinking, each of which have their own niches. — *Figure 4. Broad reasoning categories, each with its own application space*

All three techniques work by scaling test time compute, that is, by improving the quality of responses and enabling more complex problems to be solved by generating more tokens.

While the techniques are complementary and can be applied to all the different problem spaces, the difference in how they are designed enables them to address various challenges.

Prompting AI models to think longer

Chain of thought is the most straightforward representation of this type of reasoning. We prompt the model to think step by step before generating a final answer.

An iteration on the chain of thought is the ReAct agentic framework. ReAct combines reasoning and action to perform multi-step decision-making. Generating reasoning traces helps the agent develop a strategic plan by breaking the complex problem into smaller manageable tasks. The action step helps execute the plan by interfacing with external tools.

Another technique that attempted to imbue deeper thoughts was self-reflection, which introduced a critique loop. This forces the agent to analyze and re-assess the reasoning, enabling it to correct itself and generate a more reliable answer.

This concept has been supercharged by DeepSeek-R1. DeepSeek-R1 was tuned to improve the consistency and depth of the chain of thought. This model adopted a novel reinforcement learning (RL) paradigm, enabling the model to autonomously explore and refine its reasoning strategies. This makes it one of the most interesting implementations of long-chain, multi-step reasoning so far.

An example of “Deeper Thinking” are the thinking tokens that are used to generate a single cohesive chain of thought to reason through a complex problem — *Figure 5. Showcase of DeepSeek-R1’s reasoning thoughts*

This type of reasoning is best suited for working through a complex problem, such as answering a multi-hop question based on a financial report or solving a logical reasoning problem.

These techniques ultimately enable models to have a deeper understanding of the problem.

Helping AI models search for the best solution

While thinking deeper addresses the complexity of tasks, it may not be the best approach to solving tasks that have multiple solutions. Techniques such as Tree-of-thought and Graph-of-thought introduced the concept where an LLM reasons through multiple reasoning directions.

Techniques such as Best-of-N, covered in detail in Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters, rely on a simple principle. The model will likely generate the correct response if given many attempts. In essence, this enables us to ask the model the same question over and over again until it gets it right, or at least is more likely to get the response correct.

We can set N to be arbitrarily large, with some research using extremely high values of N for problems such as code generation. Generating a large volume of responses, however, is only a small part of the solution, as we need a way for the system to select the best of those N solutions.

This is where the problem of verification comes in! For some cases, this is more immediately obvious: Does the code run and pass tests? For others, it can be more complex and may rely on a reward model or some other more complex verification process.

The diagram shows three models: Best of N, Beam Search, and Look Ahead search. Each workflow starts with a question and then shows the prompt, intermediate solution steps, and solution steps selected or rejected by the verifier. — *Figure 6. Search technique types*

Interacting with Think-Critique-Improve

Instead of approaching the problem through the lens of “spending more time thinking” without feedback, approaches such as Think-Critique-Improve take advantage of a more interactive process to generate robust responses. In simple terms, the pipeline is as follows:

Think: Generate N samples, similar to Best-of-N approaches.
Generate feedback: For each of those samples, generate X feedback responses using a specialized model, which is then filtered for non-useful responses. Select Top-k of them based on some heuristic.
Edit: For each of the N samples, along with their Top-k feedback responses, a specialized editor model incorporates the feedback by editing the base model’s response.
Select: Finally, select the final response from the N feedback and edited responses produced by the pipeline with the select model.

This approach is more similar to a group working on a problem together, as opposed to a single person thinking through a problem for a long time.

As the other methods rely on verifiable problems (code, math, and logical reasoning) during their training or implementation, this method excels at solving open-ended problems that aren’t only about getting the right answer.

Next steps

With the rapid pace of advancement in models and techniques for creating business value, enterprises need to focus on time to market and polishing their features and techniques.

In this environment, solutions like NVIDIA Blueprints fast-track enterprises to build applications that enable their users. Your enterprise can ensure that you have the most efficient, secure and reliable infrastructure by using easy-to-use NVIDIA NIM.

Developers can get started today by downloading the latest open NVIDIA Llama Nemotron models on Hugging Face or trying out the Build an AI Agent for Research and Reporting NVIDIA AI Blueprint.

To read more about LLM agents, see other blogs in this series:

Discuss (0)

About the Authors

About Tanay Varshney
Tanay Varshney is a senior product research engineer at NVIDIA working with NeMo and NIMs to improve LLMs and agents. He has a master's degree in computer science from New York University focused on the cross section of computer vision, data visualization, and urban analytics.

View all posts by Tanay Varshney

About Annie Surla
Annie Surla is a Developer Advocate Engineer at NVIDIA responsible for developing and presenting a wide range of deep learning software products. She comes with experience working in deep learning applications including vision and NLP. She holds a master’s degree in Engineering Management from Duke University.

View all posts by Annie Surla

About Chris Alexiuk
Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models.

View all posts by Chris Alexiuk