NVIDIA ACE is a suite of technologies for building AI agents for gaming. ACE provides ready-to-integrate cloud and on-device AI models for every part of in-game characters, from speech to intelligence to animation.
To run these models alongside the game engine efficiently, the NVIDIA In-Game Inferencing (NVIGI) SDK includes a set of performant libraries that developers can integrate into C++ games and applications.
NVIDIA In-Game Inferencing SDK 1.5 introduces a new code agent sample in which an AI agent works with the player to defeat monsters in a 2D dungeon. AI agents driven by local small language models (SLMs) can make excessive calls to the GPU that compete with graphics. This post examines how to minimize the number of inference calls and maximize what each call accomplishes, reducing contention on the GPU between graphics and compute.
Code agents: Trapping the ghost
Andrej Karpathy, a founding member of OpenAI, likens working with large language models (LLMs) to summoning ghosts, an apt metaphor for LLM agents, especially ones that write code. Many custom agents limit themselves to tool calling: a function is defined, the LLM decides when to call it, and a result is returned. There is a more ambitious possibility. Instead of just calling a function, an AI agent can create the function and the code to support it. This makes the machine more powerful with less processing.
There is, however, a trade-off. An unconstrained LLM with code execution capabilities is a security issue. It can exhaust memory, hang the game process, or, as one unfortunate user discovered, wipe a hard drive while trying to “clear a cache.”
It can have benefits, like complex multi-step reasoning, dynamic adaptation, and reduced usage of the SLM. The following dives into how a potential coding agent ghoul turned into a friendly ghost eager to help.
Why code agents outshine tool-calling
When discussing AI agents, the most typical use case and approach is tool-calling. The model outputs structured JSON, the game or application parses it, and then executes the corresponding function. While the ability to call functions is powerful, it only comes after the model has had a chance to think on it a bit, and inference is expensive, especially when it fights for resources on the user’s GPU.
Once the model sends the JSON, it waits for a response, thinks again, and returns an answer—potentially repeating the cycle. This can consume valuable fractions of a second that could be spent rendering the game.
Moreover, if complex logic is required around the function call, the system must rely on weaker model capabilities. The model doesn’t inherently handle looping; it simply produces tokens. It can try to track state variables, but there is no rigor. If multiple items need addressing, the model must remember each one without missing, duplicating, or hallucinating entries. And every item processed pays an inference cost.
Numeric analysis introduces another challenge. With tool-calling, accuracy depends on the model’s mathematical ability or on writing yet another function to ensure correctness.
Tool-calling can struggle to scale. Every function call requires another inference hit that competes for GPU resources and must be mitigated.
Code agents work by using something computers are already good at—running code. Programming is one of the emerging superpowers of language models. Instead of generating one function call at a time, a single inference can generate all the function calls at once. There’s no performance hit after the initial generation, just standard code that runs until the task is complete.
They’re also flexible. While language models can’t easily loop themselves, code agents can easily write code with loops, counters, and filters. The following is a hypothetical example of how tool calling might be used to target an enemy.
Tool-calling schema:
[
{
"name": "get_enemies_list",
"parameters": {
"properties": {
"position": {"type": "string", "description": "Position to search from"},
"radius": {"type": "number", "description": "Search radius"}
}
}
},
{
"name": "target_enemy",
"parameters": {
"properties": {
"enemy_name": {"type": "string", "description": "Name of the enemy to target"}
},
"required": ["enemy_name"]
}
}
]
When the user says “target the nearest enemy”:
- Inference call 1: SLM decides to call
get_enemies_list - Tool response: Returns
["goblin_01", "skeleton_archer_01", "orc_chief"](just strings, otherwise, full entity schemas blow out the context window) - Inference call 2: SLM sees the list, picks one, calls
target_enemy("goblin_01") - Tool response: Success
- Inference call 3: Feedback to the user about the status of the function call
Three inference calls for one decision. Consider the same “target enemy” action with a code agent.
Code agent API definition:
get_enemies(position, radius)
--[[
Find enemies near a position.
Parameters:
position (table): Center point as {row, col}
radius (number): Search radius
Returns:
table: Array of enemy entities (with .name, .position, .health, etc.)
Example:
local nearby = get_enemies(ally.position, 10)
]]
set_target(ally, enemy)
--[[
Set an ally's attack target.
Parameters:
ally (entity): The ally to command
enemy (entity): The enemy to target
Example:
set_target(warrior, nearby[1])
]]
SLM-generated code for “target the nearest enemy”:
local enemies = get_enemies(ally.position, 10)
local closest = nil
local min_dist = math.huge
for _, enemy in ipairs(enemies) do
local dx = enemy.position[1] - ally.position[1]
local dy = enemy.position[2] - ally.position[2]
local dist = math.abs(dx) + math.abs(dy)
if dist < min_dist then
min_dist = dist
closest = enemy
end
end
if closest then
set_target(ally, closest)
end
With one inference call, the SLM loops over enemies, accesses their positions, calculates distances, and picks the closest. The code agent gets rich entity objects, not just strings, and composes logic that the tool designer never anticipated.
Notice the flexibility. That same get_enemies function works for enemies near the player, near an ally, or near a point. Once the SLM has the enemy list, it can write any selection logic, such as targeting enemies weak to arrows, targeting the closest one, or targeting the one with the lowest health. With tool-calling, adapting to new requirements means more tools, more inference calls, and more complexity. With code agents, the SLM composes new strategies at runtime from the same simple primitives.
Code agent sample dungeon
Keeping with the ghoulish theme, the IGI SDK includes an ASCII dungeon crawler to demonstrate the code agent. The dungeon contains all the pieces of a large game, but in one of gaming’s simplest forms. Players move around, collect items, and fight monsters. But they also have a powerful ally on their adventure, an AI agent. An intelligence that can materialize on demand to help them fight, go on dangerous missions, or provide information about the dangers that await.

Once an instruction is given, the code is written, and the program doesn’t touch the SLM again until a new instruction is given. A tool call chain may produce the same results, but at the cost of repeated inference calls eating into the allocated frame time slice.
The threat model of a code agent
Using an SLM to generate code that runs on the host introduces obvious security and safety risks, including:
- Dangerous function access. The SLM generates
os.execute("rm -rf /")orrequire("socket"), and suddenly the code agent is deleting files or opening network connections. - Unauthorized file access. The SLM locates critical files or API keys to exfiltrate or delete.
- Resource exhaustion. The SLM writes a loop that allocates memory forever.
- Stack overflow. The SLM writes a recursive function without proper termination.
- Infinite loops. The SLM writes
while true do endand never returns. - Escaping the sandbox. The SLM might manipulate internal structures to break out of its containment.
- State corruption. The SLM might corrupt the game or application’s state.
Choosing a target language
When choosing a target language, consider: time to execution, general performance, complexity of integration and debugging, and the quality and safety of code produced.
While running a game, inference calls must take a fraction of the total frame time. Large hits that stall the rendering pipeline are unacceptable. While it’s possible to generate a few tokens at a time each frame to smooth out inference, compilation does not offer that flexibility. This rules out compiled languages such as C++ or C#. Instead, an interpreted language is required.
Two languages stand out as examples: Python and Lua.
Python is the obvious first choice. SLMs generate Python fluently. The ecosystem is massive. But Python wasn’t designed for embedding or sandboxing. The Global Interpreter Lock (GIL) complicates multi-threaded hosts. Isolation requires subprocesses or subinterpreters, both adding complexity. Further, there’s no built-in way to limit memory or execution time. Python can run in a sandbox, but it’s a fight against the language the whole way.
Lua was designed from the ground up for embedding in hostile environments. The entire runtime is about 200 kB and starts in sub-millisecond time. Plus, every identified threat has documented mitigation, including:
- Dangerous functions: Selective library loading. Don’t load
iooros, and they don’t exist. - Memory exhaustion: Custom allocator hook. Track every allocation, enforce a cap.
- Stack overflow: Debug hooks on function calls. Count depth, error on overflow.
- Infinite loops: Debug hooks on instruction count. Error after N instructions.
- Metatable manipulation: Remove
getmetatable/setmetatablefrom globals. - State corruption: Custom
_ _newindexmetamethods that reject writes to protected fields.
Thus, Lua met all the requirements for this IGI sample but still required hardening. With Lua, dangerous or unwanted functions is set to nil (lua_pushnil(L); lua_setglobal(L,”funcname”) pattern). Memory growth is limited by wrapping the default allocator and tracking allocations. The programmer can set up hooks ( lua_sethook ) to make sure programs don’t blow out the call stack or hang indefinitely. Similarly, metatable access can be restricted with custom metamethods locked down to protect the game state.
These are just some of the steps taken to lock down this sample. More may be required (depending on each particular game or use case), but these tips should help guide the reader while looking through the code.
For added security, Lua can be embedded in a web assembly runtime. See the blog posts Sandboxing Agentic AI Workflows with WebAssembly and Practical Security Guidance for Sandboxing Agentic Workflows and Managing Execution Risk for more information about ways to secure agentic behavior.
Security is a core concern, not an afterthought. Language choice is a security decision, not a convenience decision. Start with this premise and understand the different attack vectors to guard against, and the ghost stays a friend in the machine.
Get started with NVIDIA In-Game Inferencing SDK
Try the sample with the NVIDIA In-Game Inference SDK. Build it, experiment, and think of ways to employ it in games, apps, and other projects.
Join us at GDC
Explore how NVIDIA RTX neural rendering and AI are shaping the next era of gaming. Get a glimpse into the future of game development with John Spitzer, vice president of Developer and Performance Technology at NVIDIA, as he unveils the latest innovations in path tracing and generative AI workflows.
Join Bryan Catanzaro, vice president of Applied Deep Learning Research at NVIDIA, for an interactive “Ask Me Anything” session covering the latest trends in AI.