You’ve built a powerful AI agent and are ready to share it with your colleagues, but have one big fear: Will the agent work if 10, 100, or even 1,000 coworkers try to use it at the same time? Answering this critical question is a key part of bringing an AI agent to production. We recently faced this question as part of our internal deployment of a deep-research agent using the AI-Q NVIDIA Blueprint, an agentic application built using LangGraph.
This post will cover the tools and techniques from the NVIDIA NeMo Agent Toolkit we used to deploy and scale our agentic AI application into production.
How to build a secure, scalable deep-researcher
The use of deep-research applications is ubiquitous, with many individuals regularly using tools like Perplexity, ChatGPT, or Gemini. However, like many organizations, using these deep research tools with NVIDIA confidential information can be tricky. For this reason, NVIDIA earlier this year released an open source blueprint for building a deep-research application that can be deployed on-premise. This blueprint was the starting point for our internal production deployment of a deep-research assistant.
Architecture
The AI-Q research agent allows users to upload documents and extract their metadata, access internal data sources, and search the web to create research reports. The blueprint is implemented using the NeMo Agent Toolkit and uses a variety of NVIDIA NeMo Retriever models for document ingest, retrieval, and large language model (LLM) invocations.
Our production deployment uses an internal OpenShift cluster following our AI factory reference architecture, with access to locally deployed NVIDIA NIM microservices and third-party observability tools. Our challenge was identifying what parts of the system needed to scale to support a rollout to hundreds of users across different NVIDIA teams.

To address this challenge we followed a three-step process using tools from the NeMo Agent Toolkit at each phase:
- Profile the application as a single user to identify bottlenecks.
- Run a load test to collect data and estimate the architecture needed for hundreds of users.
- Monitor the application during a phased rollout.
Step 1: How do you profile and optimize a single agentic application?
One challenge with bringing an agentic application to production is that every agentic application is different. It is difficult to create generic guidelines like “an AI application will need one GPU per 100 users.” Instead, the first step to scaling out an application is to deeply understand how the application works for one user. The NeMo Agent Toolkit offers an evaluation and profiling system to make it easy to gather data and come to a quantitative understanding of the application’s behavior.
Using the NeMo Agent Toolkit profiler
To use the evaluation and profiling tool, simply add an evaluation section to your application’s config file. The eval config includes a dataset that contains sample user inputs for the application. Agentic applications are not deterministic, so it can be useful to profile various inputs to understand how the application will perform across a wide variety of inputs a user might provide.
eval:
general:
output_dir: single_run_result
dataset:
_type: json
file_path: example_inputs.json
profiler:
# Compute inter query token uniqueness
token_uniqueness_forecast: true
# Compute expected workflow runtime
workflow_runtime_forecast: true
# Compute inference optimization metrics
compute_llm_metrics: true
# Compute bottleneck metrics
bottleneck_analysis:
enable_nested_stack: true
concurrency_spike_analysis:
enable: true
spike_threshold: 7
The AI-Q research agent is a LangGraph application that uses the NeMo Agent Toolkit function wrappers. These wrappers allow the profiler to automatically capture timing and token usage for different parts of the application. We can also track sub-steps within the application by adding simple decorators to the functions we care about.
from aiq.profiler.decorators.function_tracking import track_function
@track_function(metadata={"source": "custom_function"})
def my_custom_function(a, b):
return a + b
The eval command runs the workflow across the input dataset and collects/calculates a variety of useful metrics.
aiq eval --config_file configs/eval_config.yml
One example of the available output is a Gantt (or Waterfall) chart. The chart shows which functions are executing during each part of a user session. This information allowed us to identify what parts of our application were likely to become a bottleneck. For the AI-Q research agent, the main bottleneck were calls to the NVIDIA Llama Nemotron Super 49B reasoning LLM. Knowing the bottleneck allowed us to focus on replicating and scaling out the deployment of the NVIDIA NIM for that LLM.

Evaluating accuracy
In addition to capturing timing and token usage, the evaluation and profiling tool can compute evaluation metrics. In our case, it wasn’t just important to have an app that was fast and responsive for many users, but it also needed to generate useful reports. We created custom metrics relevant to our deep research use case and used the profiling and evaluation tool to benchmark different versions of the application code. This benchmarking ensured any optimizations we made did not reduce report quality. The toolkit reports metrics in a variety of formats, but a particularly useful option is exporting them to a platform like Weights and Biases to track and visualize experiments over time.

Step 2: Can your architecture handle 200 users? Estimating your needs
After understanding and optimizing the application performance for one user, we were ready to take the next step: load testing across multiple users. The goals of the load test were (a) run the application at a higher concurrency, (b) fix anything that broke, and (c) collect data to inform the requirements for our final deployment.
To understand what architecture would support 200 concurrent users, we ran a load test of 10, 20, 30, 40, and 50 concurrent users with our available hardware. The data collected during the load test was then used to forecast the hardware needs for the full deployment.
To perform the load test we used the NeMo Agent Toolkit sizing calculator.
Capture concurrency data
The toolkit sizing calculator works by using the same profiling and evaluation tool to run simulated workflows, but in parallel at different concurrency levels.
aiq sizing calc
--calc_output_dir $CALC_OUTPUT_DIR
--concurrencies 1,2,4,8,16,32
--num_passes 2
The calculator captures a variety of metrics during the load test, including p95 timing for each LLM invocation and p95 timing for the workflow as a whole. *Note, the output depicted below is for a toolkit example, not the actual data for the internal deep research agent load test.

Forecast for scale out
After capturing data at different concurrencies, we can understand how many users our existing architecture and hardware can support. For example, in the output below, assume we run our load test on one GPU. The results tell us one GPU can support 10 concurrent users within our latency threshold. With that information we can extrapolate the need for 10 GPUs for 100 concurrent users.

Other learnings
The other benefit of performing a load test is that it helps uncover bottlenecks or bugs in the application that may not be obvious from a single user run. In our initial load test of the AI-Q research agent, for example, we identified and corrected two bugs:
1. We monitored hardware metrics during the load test and found that one of the NVIDIA NIM microservices was using 100% of its allocated CPU. This finding helped us fix the root cause, which was a misconfiguration in our helm chart that had deployed the NIM with fewer CPUs than intended.

2. We identified a number of places where the application would fail if the LLM call timed out. We were able to add retries and better error handling so that intermittent failures would not break the entire user experience, allowing for more graceful degradation.
try:
async with asyncio.timeout(ASYNC_TIMEOUT):
async for chunk in chain.astream(input, stream_usage=True):
answer_agg += chunk.content
if "</think>" in chunk.content:
stop = True
if not stop:
writer({"generating_questions": chunk.content})
except asyncio.TimeoutError as e:
writer({"generating_questions": "Timeout error from reasoning LLM, please try again"})
return {"queries": []}
Step 3: How to monitor, trace, and optimize your research agent’s performance as you scale up to production
With all this information in hand, we were able to deploy the AI-Q research agent with the appropriate number of replicas across various system components. As a final step, we scaled out using a phased approach—starting with small teams and gradually adding additional users. During the rollout, it was critical to observe application performance. We used the NeMo Agent Toolkit OpenTelemetry (OTEL) collector along with Datadog to capture logs, performance data, and LLM trace information.
general:
telemetry:
tracing:
otelcollector:
_type: otelcollector
# Your otel collector endpoint
endpoint: http://0.0.0.0:4318/v1/traces
project: your_project_name
The OTEL collector integration allows us to view specific traces for individual user sessions, helping us understand both application performance and LLM behavior.

We are also able to aggregate performance data across traces to understand how the application was performing. The following chart shows average latency and user sessions with outlying performance.

Conclusion
By using the NeMo Agent Toolkit in conjunction with a variety of AI factory reference partners, we were able to deploy an internal version of the AI-Q NVIDIA Blueprint and build a research agent with confidence.
Learn more about building with NeMo Agent Toolkit or try out the AI-Q NVIDIA research agent blueprint for yourself.