Generative AI

How to Scale Your LangGraph Agents in Production From A Single User to 1,000 Coworkers

You’ve built a powerful AI agent and are ready to share it with your colleagues, but have one big fear: Will the agent work if 10, 100, or even 1,000 coworkers try to use it at the same time? Answering this critical question is a key part of bringing an AI agent to production. We recently faced this question as part of our internal deployment of a deep-research agent using the AI-Q NVIDIA Blueprint, an agentic application built using LangGraph.

This post will cover the tools and techniques from the NVIDIA NeMo Agent Toolkit we used to deploy and scale our agentic AI application into production.

How to build a secure, scalable deep-researcher

The use of deep-research applications is ubiquitous, with many individuals regularly using tools like Perplexity, ChatGPT, or Gemini. However, like many organizations, using these deep research tools with NVIDIA confidential information can be tricky. For this reason, NVIDIA earlier this year released an open source blueprint for building a deep-research application that can be deployed on-premise. This blueprint was the starting point for our internal production deployment of a deep-research assistant.

Architecture

The AI-Q research agent allows users to upload documents and extract their metadata, access internal data sources, and search the web to create research reports. The blueprint is implemented using the NeMo Agent Toolkit and uses a variety of NVIDIA NeMo Retriever models for document ingest, retrieval, and large language model (LLM) invocations.

Our production deployment uses an internal OpenShift cluster following our AI factory reference architecture, with access to locally deployed NVIDIA NIM microservices and third-party observability tools. Our challenge was identifying what parts of the system needed to scale to support a rollout to hundreds of users across different NVIDIA teams.

A diagram of an agentic system showing a user prompt going to an agent, which coordinates reasoning, report generation, web search, and enterprise file retrieval using NVIDIA Llama Nemotron reasoning, NeMo Retriever, and LLM NIM microservices.
Figure 1. AI-Q research agent blueprint architecture diagram

To address this challenge we followed a three-step process using tools from the NeMo Agent Toolkit at each phase:

  1. Profile the application as a single user to identify bottlenecks.
  2. Run a load test to collect data and estimate the architecture needed for hundreds of users.
  3. Monitor the application during a phased rollout.

Step 1: How do you profile and optimize a single agentic application?

One challenge with bringing an agentic application to production is that every agentic application is different. It is difficult to create generic guidelines like “an AI application will need one GPU per 100 users.” Instead, the first step to scaling out an application is to deeply understand how the application works for one user. The NeMo Agent Toolkit offers an evaluation and profiling system to make it easy to gather data and come to a quantitative understanding of the application’s behavior.

Using the NeMo Agent Toolkit profiler

To use the evaluation and profiling tool, simply add an evaluation section to your application’s config file. The eval config includes a dataset that contains sample user inputs for the application. Agentic applications are not deterministic, so it can be useful to profile various inputs to understand how the application will perform across a wide variety of inputs a user might provide.

eval:
  general:
    output_dir: single_run_result
    dataset:
      _type: json
      file_path: example_inputs.json
    profiler:
      # Compute inter query token uniqueness
      token_uniqueness_forecast: true
      # Compute expected workflow runtime
      workflow_runtime_forecast: true
      # Compute inference optimization metrics
      compute_llm_metrics: true
      # Compute bottleneck metrics
      bottleneck_analysis:
        enable_nested_stack: true
      concurrency_spike_analysis:
        enable: true
        spike_threshold: 7


The AI-Q research agent is a LangGraph application that uses the NeMo Agent Toolkit function wrappers. These wrappers allow the profiler to automatically capture timing and token usage for different parts of the application. We can also track sub-steps within the application by adding simple decorators to the functions we care about.

from aiq.profiler.decorators.function_tracking import track_function

@track_function(metadata={"source": "custom_function"})
def my_custom_function(a, b):
  return a + b

The eval command runs the workflow across the input dataset and collects/calculates a variety of useful metrics.

aiq eval --config_file configs/eval_config.yml

One example of the available output is a Gantt (or Waterfall) chart. The chart shows which functions are executing during each part of a user session. This information allowed us to identify what parts of our application were likely to become a bottleneck. For the AI-Q research agent, the main bottleneck were calls to the NVIDIA Llama Nemotron Super 49B reasoning LLM. Knowing the bottleneck allowed us to focus on replicating and scaling out the deployment of the NVIDIA NIM for that LLM.

Gantt chart showing the sequence and overlap of steps in a report writing process, including planning, parallel search, section writing, reflections, and final summary. Tasks are color coded and labeled with brief descriptions.
Figure 2. Gantt chart from the NeMo Agent Toolkit showing timing and bottlenecks

Evaluating accuracy

In addition to capturing timing and token usage, the evaluation and profiling tool can compute evaluation metrics. In our case, it wasn’t just important to have an app that was fast and responsive for many users, but it also needed to generate useful reports. We created custom metrics relevant to our deep research use case and used the profiling and evaluation tool to benchmark different versions of the application code. This benchmarking ensured any optimizations we made did not reduce report quality. The toolkit reports metrics in a variety of formats, but a particularly useful option is exporting them to a platform like Weights and Biases to track and visualize experiments over time.

A dashboard compares two machine learning models using bar and radar charts for metrics like accuracy, recall, and precision. Feature importances are listed below.
Figure 3. Comparison of metrics between two different feature branches

Step 2: Can your architecture handle 200 users? Estimating your needs

After understanding and optimizing the application performance for one user, we were ready to take the next step: load testing across multiple users. The goals of the load test were (a) run the application at a higher concurrency, (b) fix anything that broke, and (c) collect data to inform the requirements for our final deployment. 

To understand what architecture would support 200 concurrent users, we ran a load test of 10, 20, 30, 40, and 50 concurrent users with our available hardware. The data collected during the load test was then used to forecast the hardware needs for the full deployment.

To perform the load test we used the NeMo Agent Toolkit sizing calculator.

Capture concurrency data 

The toolkit sizing calculator works by using the same profiling and evaluation tool to run simulated workflows, but in parallel at different concurrency levels.

aiq sizing calc 
 --calc_output_dir $CALC_OUTPUT_DIR 
 --concurrencies 1,2,4,8,16,32
 --num_passes 2

The calculator captures a variety of metrics during the load test, including p95 timing for each LLM invocation and p95 timing for the workflow as a whole. *Note, the output depicted below is for a toolkit example, not the actual data for the internal deep research agent load test.

Line graph showing the relationship between concurrency and p95 LLM latency and workflow runtime; as concurrency increases, both latency and runtime increase, with workflow runtime rising more steeply.
Figure 4. Timing data captured by the NeMo Agent Toolkit sizing calculator

Forecast for scale out

After capturing data at different concurrencies, we can understand how many users our existing architecture and hardware can support. For example, in the output below, assume we run our load test on one GPU. The results tell us one GPU can support 10 concurrent users within our latency threshold. With that information we can extrapolate the need for 10 GPUs for 100 concurrent users.

Two scatter plots comparing "Concurrency vs P95 LLM Latency" (left) and "Concurrency vs P95 Workflow Runtime" (right). Both show increasing trends, with one outlier removed in the left plot. The axes indicate concurrency (x-axes) and latency/runtime in seconds (y-axes). Legends and trend lines are included in both plots.
Figure 5. Forecast of hardware needs from the NeMo Agent Toolkit sizing calculator

Other learnings

The other benefit of performing a load test is that it helps uncover bottlenecks or bugs in the application that may not be obvious from a single user run. In our initial load test of the AI-Q research agent, for example, we identified and corrected two bugs:

1. We monitored hardware metrics during the load test and found that one of the NVIDIA NIM microservices was using 100% of its allocated CPU. This finding helped us fix the root cause, which was a misconfiguration in our helm chart that had deployed the NIM with fewer CPUs than intended.

Line graph showing total CPU percentage utilization over time, with a sharp rise and brief fluctuations before stabilizing at around 100%.
Figure 6. CPU starvation during a stress test

2. We identified a number of places where the application would fail if the LLM call timed out. We were able to add retries and better error handling so that intermittent failures would not break the entire user experience, allowing for more graceful degradation.

try: 
async with asyncio.timeout(ASYNC_TIMEOUT):
  async for chunk in chain.astream(input, stream_usage=True):
    answer_agg += chunk.content
      if "</think>" in chunk.content:
        stop = True
      if not stop:
       writer({"generating_questions": chunk.content})

except asyncio.TimeoutError as e: 
writer({"generating_questions": "Timeout error from reasoning LLM, please try again"})
return {"queries": []}

Step 3: How to monitor, trace, and optimize your research agent’s performance as you scale up to production 

With all this information in hand, we were able to deploy the AI-Q research agent with the appropriate number of replicas across various system components. As a final step, we scaled out using a phased approach—starting with small teams and gradually adding additional users. During the rollout, it was critical to observe application performance. We used the NeMo Agent Toolkit OpenTelemetry (OTEL) collector along with Datadog to capture logs, performance data, and LLM trace information.

general:
  telemetry:
    tracing:
   otelcollector:
     _type: otelcollector
       # Your otel collector endpoint
       endpoint: http://0.0.0.0:4318/v1/traces
       project: your_project_name

The OTEL collector integration allows us to view specific traces for individual user sessions, helping us understand both application performance and LLM behavior.

A performance trace dashboard showing a timeline of tasks, including "generate_summary" and "search_msg," with bars indicating their execution times and overlaps.
Figure 7. Datadog flame graph showing timing for a real user session

We are also able to aggregate performance data across traces to understand how the application was performing. The following chart shows average latency and user sessions with outlying performance. 

A dashboard screenshot showing high latency span analysis, with graphs and data tables breaking down unusual and typical latency by resource name, span kind, input mime type, and input value.
Figure 8. Datadog latency analysis showing p95 times and outliers for individually tracked functions.

Conclusion

By using the NeMo Agent Toolkit in conjunction with a variety of AI factory reference partners, we were able to deploy an internal version of the AI-Q NVIDIA Blueprint and build a research agent with confidence. 

Learn more about building with NeMo Agent Toolkit or try out the AI-Q NVIDIA research agent blueprint for yourself.

Discuss (0)

Tags