Build an LLM-Powered API Agent for Task Execution

Developers have long been building interfaces like web apps to enable users to leverage the core products being built. To learn how to work with data in your large language model (LLM) application, see my previous post, Build an LLM-Powered Data Agent for Data Analysis. In this post, I discuss a method to add free-form conversation as another interface with APIs. It works toward a solution that enables nuanced conversational interaction with any API.

For a basic understanding of LLM agents and how they can be built, see Introduction to LLM Agents and Building Your First LLM Agent Application.

What is an API agent?

An API or execution agent is designed for an execution goal. These agents carry out a task or set of tasks requested by a user by using a set of predefined executive functions. Answering users’ questions based on data sources is an important part of this. Another important piece is to execute what a user (human) or ‌another agent (machine) requires.

Traditionally, this is done through APIs and some form of application logic and interaction layer such as a web application or page. The user must decide on an execution flow, access the APIs with buttons, or write code.

Now, it’s possible to add the ability to offload part of the nuts-and-bolts reasoning, as well as a medium to “talk to” the API or SDK or software, so the agent will figure out the details of the interaction.

Building an API agency

To explore this topic, I’ll build a “creative copilot” that can help a marketing organization make use of an API agent to start brainstorming ideas for a marketing campaign.

Choose an LLM

Begin by identifying which LLM to use. This example uses the Mixtral 8x7B LLM available in the NVIDIA NGC catalog. It accelerates various models and makes them available as APIs. The first API calls per model are free for experimentation.

Note that if you’re working with a model that isn’t tuned to handle agent workflows, you can reformulate the following prompts as a series of multiple-choice questions (MCQs). This should work, as most of the models are instruction-tuned to handle MCQs. Learn more about fine-tuning.

Select a use case

Next, select a use case. You can explore a variety of NVIDIA AI Foundation Models in the NGC catalog (Figure 1).

For the purposes of this discussion, I’ll try to chain NVIDIA AI Foundation Model APIs to create a “brainstorming copilot.” You can do the same with any other API for building an execution-oriented agent. I’ll use the following three models for this use case:

Mixtral 8x7B Instruct for text generation
Stable Diffusion XL for image generation
Code Llama 34B for code generation

Screenshot of NVIDIA AI Foundation Models in the NGC catalog — *Figure 1. NVIDIA AI Foundation Models available in the NGC catalog*

Build the agent

An AI agent is composed of four components: tools, memory module, planning module, and agent core.

Tools

For this use case, the tools are the individual function calls to the models. To simplify this discussion, I’ve made classes for each of the API calls for the models (Figures 2 and 3).

Screenshot of Python script to ping the Mixtral 8x7B model. — *Figure 2. Python script for the Mixtral 8x7B model*

Screenshot of API structure used for the agent. — *Figure 3. General structure of the API used for the agent*

Planning module and agent core

In previous posts, I’ve showcased different planning algorithms. For an example of an explicit Question Decomposition module with a recursive solver, see Introduction to LLM Agents. For an example of a greedy iterative execution plan, see Building Your First LLM Agent Application.

Here, I’ll present the Plan-and-Execute approach that fuses the planning module and the agent core. This is an advanced implementation and essentially compiles a plan before execution. For more details, see An LLM Compiler for Parallel Function Calling.

Because Plan-to-Execute generates a full plan at the start, there is no need to keep ‌track of every step and a memory module is not necessary for the single-turn conversation case. Note that you can still choose to have an iterative plan generation. I am featuring the static generation here to simplify the explanation.

When to use an LLM compiler-style approach plus fused planning and core

APIs are deterministic. Therefore, preplanning is possible, as you can be fairly confident about the behavior and results of the individual tools. You can save on the additional tokens that need to be generated for an iterative or dynamic flexible planning module. Furthermore, for cases that require several steps to solve a problem, this plan step helps maintain a more concise context for the LLM. In addition, fusing two modules can simplify the general architecture.

When not to use an LLM compiler-style approach plus fused planning and core

Plan-and-Execute is a brittle technique. If the plan fails due to a tool issue, or if the generated plan was incorrect, there’s no path to recovery. In addition, the LLM that powers the fused module must have been tuned effectively to handle the complex logic of generating a plan incorporating the tool’s use.

The prompt is shown below. Because this example uses the Mixtral 8x7B model, the function-calling schema that the model was trained on can also be used. It will generate a plan that can then be executed sequentially for the final result.

prompt = """Your task is to generate a plan for the problem user gave using FUNCTIONS. Just generate the plan. Do not solve.

<FUNCTIONS>[
    {
        "function": "ImageGenerator",
        "description": "Generates an Image based on a prompt description",
        "arguments": [
            {
                "name": "prompt",
                "type": "string",
                "description": "Describe what is the key subject of the image, followed by the background."
            },
            {
                "name": "negative_prompt",
                "type": "string",
                "description": "what shouldn't be in the image. Fill none if not specified."
            }
        ]
    },
    {
        "function": "CodeGenerator",
        "description": "Generates python code for a described problem",
        "arguments": [
            {
                "name": "prompt",
                "type": "string",
                "description": "description of the problem for which the code needs to be generate"
            }
        ]
    },
    {
        "function": "TextGenerator",
        "description": "Generates well reasoned text for questions. Requires the full complete context.",
        "arguments": [
            {
                "name": "prompt",
                "type": "string",
                "description": "Describe in detail about the question that requires an answer"
            }
        ]
    }
]
</FUNCTIONS>

User: <User’s question>.

Assistant: ```json
"""

Guidance on function calling

While Mixtral 8x7B was tuned for function calling, it can still generate verbose outputs that don’t adhere to a syntactical format. I suggest using one of the output token-constraining techniques, which enables you to ensure the syntactical correctness of the output, not just fine-tune the LLM for semantic correctness. Additional libraries include local-LLM-function-calling and lm-format-enforcer.

A complex example use case

To provide a complex example, I’ll try to “ramble” requests in the way I would talk to a human.

Question: Write a social media post for my ad campaign around selling more detergent. The name of the product is WishyWash, now with a new UltraClean formula, priced at $4.99. Also generate an image to go with it. Actually, while you’re at it, also add that the new product has a softner in the social media post. And brainstorm some other ideas for marketing apart from the social media post.

Generated plan:

{
    "function": "ImageGenerator",
    "arguments": {
        "prompt": "A bright and clean laundry room with a large bottle of WishyWash detergent, featuring the new UltraClean formula and softener, placed prominently on a shelf.",
        "negative_prompt": "No clutter, no other brands, only WishyWash."
    }
},
{
    "function": "TextGenerator",
    "arguments": {
        "prompt": "Compose a tweet to promote the new WishyWash detergent with the UltraClean formula and softener at $4.99. Make sure to highlight its benefits and competitive pricing."
    }
},
{
    "function": "TextGenerator",
    "arguments": {
        "prompt": "Generate ideas for marketing campaigns to increase WishyWash detergent sales, specifically focusing on the new UltraClean formula and softener."
    }
}

The plan formulated involves three steps:

Generate an image
Write a social media post
Generate more ideas for a marketing campaign

Next, trigger the individual tools for the final output.

Image generation with Stable Diffusion XL

Although the image generated by Stable Diffusion XL isn’t the best (Figure 4), it is an excellent starting point for brainstorming with an expert editor. That said, if your use case relies on the ability to have proper words, you can fine-tune the model further to address this issue.

Image of laundry detergent and a pile of folded towels generated by Stable Diffusion XL. — *Figure 4. Image generated by Stable Diffusion XL*

Text generation with Mixtral 8x7B

The social media post and additional marketing ideas generated by Mixtral 8x7B are shown in Figures 5 and 6, respectively. In this case, the agent was able to break down a complex problem and provide solutions from a rambling set of instructions.

Image of social media post generated by Mixtral 8x7B — *Figure 5. Social media post generated by Mixtral 8x7B*

Screenshot of generated ideas to build a marketing plan. — *Figure 6. Marketing plan ideas generated by Mixtral 8x7B*

Note that this specific example is meant to inspire a copilot, where the output is a starting point for an expert human. For an API with a more deterministic output (such as an SDK for interacting with the stock market or a weather app), the function calls can be executed directly. The core value is the ability to reason through a request and use execution-oriented tools to fulfill a request.

Key considerations when building API agent applications

Keep in mind the following key considerations when building your API agent application.

Scaling the APIs

Three APIs were used in this example. Scaling the approach will require building a retrieval-augmented generation (RAG) system to look for the top five most relevant tools, given a user’s question. It’s not possible to continually add all the APIs that can be executed to solve a task.

Better planning

This example used the Compiler/Plan and Execute style solution, but a better planner such as ADaPT can be used for chaining different APIs. A better planning algorithm can help tackle more complex cases and failure instances in the plan.

Get started

This post has covered the basics of how to build an LLM-powered API execution agent. The discussion is agnostic to any popular open-source framework to help get more familiar with the concepts behind building agents. I highly recommend exploring the open-source ecosystem to select the best agent framework for your application.

For more information about building reliable, scalable pieces for the API agent for production, check out the AI Chatbot with Retrieval-Augmented Generation Workflow. If you’re looking to experiment with a production-grade RAG pipeline for your LLM application, visit NVIDIA/GenerativeAIExamples on GitHub.

Build an LLM-Powered API Agent for Task Execution

What is an API agent?