Agentic AI / Generative AI

Enabling Horizontal Autoscaling of Enterprise RAG Components on Kubernetes

Today’s best AI agents rely on retrieval-augmented generation (RAG) to enable more accurate results. A RAG system facilitates the use of a knowledge base to augment context to large language models (LLMs). A typical design pattern includes a RAG server that accepts prompt queries, consults a vector database for nearest context vectors, and then redirects the query with the appended context to an LLM service for an appropriate response. 

The NVIDIA RAG Blueprint is a reference example that accelerates RAG deployment in enterprise environments. It provides modular components for ingestion, vectorization, retrieval, and generation, along with configurable options such as metadata filtering, query rewriting, reranking, and multimodal embedding. It provides an easy Docker deployment option as well as a production-ready Kubernetes deployment. 

RAG workloads can be unpredictable and bursty—demand can spike suddenly during peak usage periods (for example, morning news cycles, viral events, or end-of-quarter reporting) and drop to near-zero during off-peak hours. Without autoscaling, organizations face a painful tradeoff: either overprovision compute resources to handle worst-case scenarios (wasting capital on idle GPUs 80% of the time) or underprovision and risk service degradation, timeouts, and lost requests resulting in poor user experience during demand surges. 

This post walks you through how to autoscale key NVIDIA NIM microservices in the RAG Blueprint for one key RAG use case: a customer service chatbot (ISL/OSL 256/256) with stringent performance load and latency requirements. It explains how to leverage Kubernetes Horizontal Pod Autoscaling (HPA) with the following NVIDIA NIM microservices: Nemotron LLM, Nemotron Rerank, and Nemotron Embed using available and custom metrics. This enables you to autoscale NIM microservices within limits and service level agreements (SLAs) defined by application needs.

How does a RAG system work?

In a Kubernetes production environment, it’s important to understand the compute and memory profile along with the performance profile of the end-to-end RAG system and its individual components. Metrics such as latency and throughput are fundamental for service scaling and cluster resource planning.

Each use case (defined by ISL/OSL) will require different performance load, concurrency, and Latency SLA requirements Time to First Token (TTFT) and End-to-End Request Latency (e2e_latency)

Some performance and latency SLA requirements for example RAG use cases are outlined below. Note that for this post, we’ve assumed a few typical ISL/OSL values for three representative use cases, but focus only on the first—a customer service chatbot.

  • A customer service chatbot (ISL/OSL 256/256): Requires scaling concurrency from 100 to 300 concurrent requests (CR) and fast response because it has a direct impact on customer experience. This requires TTFT under 2 seconds (TTFT <2s) and end-to-end response time under 20 seconds (e2e_latency <20s).
  • An email summarization service (ISL/OSL 2048/1024): May require a lower concurrency (CR=100) and a reasonable response SLA (TFTT <10s, e2e_latency <40s) that is not as strict.
  • A research agent (ISL/OSL 512/4096): May require much lower concurrency (CR=25) and have a more forgiving overall SLA (e2e_latency <120s) because it is anticipated that research across data sources and report creation is by nature a more asynchronous task. 

In a RAG system, there are three distinct phases: ingestion, retrieval, and answer generation. When a user query is processed, it broadly consists of retrieving results, reranking the results and using those with an LLM to generate an answer. After observing the latency metrics of the RAG pipeline, the LLM NIM is the largest contributor to the service latency followed by the reranking NIM. The LLM NIM needs to be scaled out whenever the load (measured as the number of concurrent requests or queue depth) increases significantly causing the latency SLA for the use case to be exceeded (TTFT >2s, for example). 

The reranking NIM needs to be scaled out when retrieval reranking load is high (hundreds of concurrent requests) and GPU usage capacity is high (>75%). The embedding NIM needs to be scaled out when embedding request load is extremely high (thousands of concurrent requests) and GPU usage capacity is high (>75%). 

For the RAG ingestion pipeline, the embedding NIM and vector DB Indexing can become latency bottlenecks. To reduce the vector DB latency overhead the RAG Blueprint uses Milvus vector DB with GPU acceleration for  indexing using GPU CAGRA with NVIDIA cuVS. However, during high ingestion load, when embedding thousands of data chunks at a time, the embedding NIM must be scaled out. 

How to autoscale the RAG retrieval pipeline

The following tutorial focuses on autoscaling the RAG retrieval pipeline for the most stringent customer service chatbot (ISL/OSL 256/256) use case. It requires scaling to concurrency CR=300 with first response latency TTFT <2s. We will use the Kubernetes Horizontal Pod Autoscaler (HPA) to scale out the LLM NIM and the GenAI-Perf tool for load generation. 

For autoscaling the LLM NIM, we will use available metrics for concurrency and KV cache and create a custom latency metric for the 90th percentile of TTFT p90. In addition, we will review autoscaling options for the reranking NIM and embedding NIM when the request load is high enough and GPU utilization spikes.

Prerequisites

The prerequisites for this tutorial include: 

Step 1: Deploy the NVIDIA RAG Blueprint

In our testing, we deployed and used the NVIDIA RAG Blueprint. This blueprint is an open source reference for a foundational RAG pipeline.

To deploy the blueprint, follow the Quick Start instructions that include a Helm deploy option.  For more information about what is included in this blueprint and for a high-level architecture description, see the NVIDIA-AI-Blueprints/rag GitHub repo. At a high level, this blueprint consists of two types of containers: 

  • NVIDIA NIM microservices, for model hosting 
  • “Glue code” containers, which provide the logic that integrates these NIM microservices

To perform RAG retrieval, it’s necessary to create a Milvus vector DB collection and ingest data. Use the RAG UI portal provided by the NVIDIA RAG Blueprint, using the port-forward command below and direct your browser to localhost:3000

From the UI create the collection multimodal_data and upload documents.

kubectl -n <namespace> port-forward svc/<service-name> 
<local-port>:<service-port>
kubectl port-forward -n rag svc/rag-frontend  3000:3000

Step 2: Enable observability metrics

After deployment, note the service name and namespace of the deployed NIM for LLM microservice, as this will be used in many commands.

The LLM NIM exposes many interesting observability metrics using the Prometheus service endpoint service port:8000 prefix:/v1/metrics. To see the metrics endpoint, use the following commands to port-forward the service to the local host local-port: 8080:

kubectl -n <namespace> port-forward svc/<service-name> 
<local-port>:<service-port>
kubectl -n rag port-forward svc/rag-nim-llm 8080:8000

To explore the LLM NIM metrics, visit http://localhost:8080/v1/metrics. The following key metrics can be used for autoscaling and are referenced later in the post:

  • num_requests_running: Measures the concurrency or concurrent requests (CRs), the number of active requests that the LLM pod is currently servicing. It is a measure of processing load on the LLM service.
  • time_to_first_token: Measures the time taken (in seconds) to receive the first response token from the LLM after sending the user query. It is a measure of latency of the LLM service. TTFT is reported as a histogram with different buckets (time_to_first_token_seconds_bucket) 1 ms, 5 ms, 10 ms, 20 ms, 40 ms, 60 ms, 80 ms, 100 ms, 250 ms, 500 ms, 750 ms, 1 s, 2.5 s, 5 s, 7.5 s, 10 s, infinity+.
  • gpu_cache_usage_perc: Measures the percentage of KV cache (GPU inference memory) used by the LLM for processing the current requests. It is a measure of processing load on the LLM service. 
# HELP num_requests_running Number of requests currently running on GPU.
# TYPE num_requests_running gauge
num_requests_running{model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0

# HELP gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE gpu_cache_usage_perc gauge
gpu_cache_usage_perc{model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0

# HELP time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE time_to_first_token_seconds histogram
time_to_first_token_seconds_sum{model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 31385.349472999573
time_to_first_token_seconds_bucket{le="0.001",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.005",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.01",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.02",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.04",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.06",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.08",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.1",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 0.0
time_to_first_token_seconds_bucket{le="0.25",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 5681.0
time_to_first_token_seconds_bucket{le="0.5",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 19593.0
time_to_first_token_seconds_bucket{le="0.75",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 26800.0
time_to_first_token_seconds_bucket{le="1.0",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 30598.0
time_to_first_token_seconds_bucket{le="2.5",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 35405.0
time_to_first_token_seconds_bucket{le="5.0",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 36189.0
time_to_first_token_seconds_bucket{le="7.5",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 36567.0
time_to_first_token_seconds_bucket{le="10.0",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 36772.0
time_to_first_token_seconds_bucket{le="+Inf",model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 37027.0
time_to_first_token_seconds_count{model_name="nvidia/llama-3.3-nemotron-super-49b-v1.5"} 37027.0

Similarly the Nemotron reranking and embedding NIM microservices expose metrics using the Prometheus metrics endpoint (service-port:8000 prefix:/v1/metrics). To view the metrics endpoint, port-forward the service to localhost local-port: 8081,8082.

kubectl -n <namespace> port-forward svc/<service-name> 
<local-port>:<service-port>
kubectl -n rag port-forward svc/nim-reranking 8081:8000
kubectl -n rag port-forward svc/nim-embedding 8082:8000

For reranking NIM metrics, visit http://localhost:8081/v1/metrics. For embedding NIM metrics, visit http://localhost:8082/v1/metrics

For autoscaling Nemotron embedding and reranking NIM microservices, this post uses the GPU resource usage metrics (gpu_utilization).

# HELP gpu_power_usage_watts GPU instantaneous power, in watts
# TYPE gpu_power_usage_watts gauge
gpu_power_usage_watts{device_id="0"} 97367.0
# HELP gpu_power_limit_watts Maximum GPU power limit, in watts
# TYPE gpu_power_limit_watts gauge
gpu_power_limit_watts{device_id="0"} 600000.0
# HELP gpu_total_energy_consumption_joules GPU total energy consumption, in joules
# TYPE gpu_total_energy_consumption_joules gauge
gpu_total_energy_consumption_joules{device_id="0"} 1.69608380193e+08
# HELP gpu_utilization GPU utilization rate (0.0 - 1.0)
# TYPE gpu_utilization gauge
gpu_utilization{device_id="0"} 0.0
# HELP gpu_memory_total_bytes Total GPU memory, in bytes
# TYPE gpu_memory_total_bytes gauge
gpu_memory_total_bytes{device_id="0"} 1.50754820096e+011
# HELP gpu_memory_used_bytes Used GPU memory, in bytes
# TYPE gpu_memory_used_bytes gauge
gpu_memory_used_bytes{device_id="0"} 4.49445888e+09

Step 3: Create a ServiceMonitor for NIM microservices

Create ServiceMonitor for all NIM microservices to scrape metrics and make available in Prometheus for the Kubernetes HPA to use for autoscaling. You need to find the Prometheus service release label and NIM service label name and service ports.

First, find the Prometheus service release label:

kubectl get svc kube-prometheus-stack-prometheus -n prometheus -o yaml | grep labels -A 10

# Command output 
  labels:
    app: kube-prometheus-stack-prometheus
    release: kube-prometheus-stack # Prometheus stack ‘release’ label to set in ServiceMonitor labels
    self-monitor: "true"
  name: kube-prometheus-stack-prometheus

kubectl get svc kube-prometheus-stack-prometheus -n prometheus -o jsonpath='{.metadata.labels.release}'

# Command output
kube-prometheus-stack

Next, find the LLM NIM service port and service label:

kubectl get svc -n rag -o yaml rag-nim-llm

# command output
apiVersion: v1
kind: Service
metadata:
  name: rag-nim-llm
  namespace: rag 
  labels:
    app: rag-nim-llm # LLM NIM service label to use in ServiceMonitor selector
spec:
  ports:
  - name: service-port # LLM NIM service port name to use in ServiceMonitor endpoints
    port: 8000
    protocol: TCP
    targetPort: 8000
  selector:
    app: rag-nim-llm
  type: ClusterIp

# alternate command to get just the port name
kubectl get svc rag-nim-llm -n rag -o jsonpath='{.spec.ports[*].name}'
service-port

Then create the Service Monitor using the information in the preceding steps:

kubectl apply -f - <<EOF

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: rag-nim-llm
    namespace: rag
  labels:
    release: kube-prometheus-stack # match the prometheus stack ‘release’ label
spec:
  endpoints:
  - port: service-port # The name of the port defined in your NIM Service
    path: /v1/metrics # The path where metrics are exposed; NIMs use /v1/metrics
    interval: 30s # How often Prometheus should scrape metrics
  namespaceSelector:
    matchNames:
    - rag # the namespace where your LLM NIM service resides
  selector:
    matchLabels:
      app: rag-nim-llm # match the LLM NIM service label(s)
EOF

Repeat these steps to create the Service Monitor for the reranking and embedding NIM microservices:

# service monitor for reranking NIM
kubectl apply -f - <<EOF

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: nim-reranking
    namespace: rag
  labels:
    release: kube-prometheus-stack # match the prometheus stack ‘release’ label
spec:
  endpoints:
  - port: service-port # The name of the port defined in your NIM Service
    path: /v1/metrics # The path where metrics are exposed; NIMs use /v1/metrics
    interval: 30s # How often Prometheus should scrape metrics
  namespaceSelector:
    matchNames:
    - rag # the namespace where your reranking NIM service resides
  selector:
    matchLabels:
      app: nim-reranking # match the reranking NIM  service label(s)
EOF

# service monitor for embedding NIM
kubectl apply -f - <<EOF

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: nim-embedding
    namespace: rag
  labels:
    release: kube-prometheus-stack # match the prometheus stack ‘release’ label
spec:
  endpoints:
  - port: service-port # The name of the port defined in your NIM Service
    path: /v1/metrics # The path where metrics are exposed; NIMs use /v1/metrics
    interval: 30s # How often Prometheus should scrape metrics
  namespaceSelector:
    matchNames:
    - rag # the namespace where your embedding NIM service resides
  selector:
    matchLabels:
      app: nim-embedding # match the embedding NIM  service label(s)
EOF

Step 4: Autoscale the LLM NIM with TTFT p90

In our first use case we will create a custom metric for LLM NIM to enable autoscaling based on latency threshold TTFT. For latency sensitive RAG use cases, the TTFT metric is the most critical metric. It is the time taken by the RAG service to process the user query and all the relevant context provided from the vector DB. Then only it is ready to respond to the query.  

For the Customer service chatbot usecase, we are going to use the 90th percentile TTFT >2s (TTFT p90 > 2s) as the Latency SLA to trigger the scale of the LLM NIM.

Validate Prometheus query for histogram metric

To validate a Prometheus query, first port-forward the Prometheus service endpoint service port:9090 prefix:/query to localhost.

kubectl port-forward -n <prometheus-namespace> svc/<prometheus-service-> 9090:9090 
kubectl port-forward -n prometheus svc/kube-prometheus-stack-prometheus 9090:9090

Then direct your browser to http://localhost:9090/query?, add the metric to query, and click Execute. To graph the metric, select the Graph tab.

Set the Prometheus query for the TTFT histogram metric time_to_first_token_seconds_bucket to show counts for all buckets, or select a specific bucket <=1s time_to_first_token_seconds_bucket{le=”1.0”}.

Screenshot of Prometheus graph for query time_to_first_token_seconds_bucket{le=”1.0”}.
Figure 1. Prometheus graph for the query time_to_first_token_seconds_bucket{le=”1.0”}

You can also use curl to query the Prometheus metric:

curl -s 'http://localhost:9090/api/v1/query'   --data-urlencode 'query=time_to_first_token_seconds_bucket{namespace="rag",service="rag-nim-llm",le="1.0"}' | jq .
{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "time_to_first_token_seconds_bucket",
          "endpoint": "service-port",
          "instance": "172.29.36.236:8000",
          "job": "rag-nim-llm-0",
          "le": "1.0",
          "model_name": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
          "namespace": "rag",
          "pod": "rag-nim-llm-0",
          "service": "rag-nim-llm"
        },
        "value": [
          1764017343.481,
          "39502"
        ]
      },
      {
        "metric": {
          "__name__": "time_to_first_token_seconds_bucket",
          "endpoint": "service-port",
          "instance": "172.29.118.179:8000",
          "job": "rag-nim-llm-1,
          "le": "1.0",
          "model_name": "nvidia/llama-3.3-nemotron-super-49b-v1.5",
          "namespace": "rag",
          "pod": "rag-nim-llm-1”,
          "service": "rag-nim-llm
        },
        "value": [
          1764017343.481,
          "24207"
        ]
      }
    ]
  }
}

Update the query to use histogram_quantile to determine the 90th percentile TTFT latency (in seconds) for all requests over the last 1 minute to the rag-nim-llm service in the rag namespace.

histogram_quantile(0.90,  
sum(rate(time_to_first_token_seconds_bucket{namespace="rag", 
service="rag-nim-llm"}[1m])) by (le))

Use the NVIDIA GenAI perf tool (detailed script and steps follow) to add the load and perform a baseline RAG retrieval performance test. Set a concurrency load of 100 concurrent requests (CR=100) and run multiple times to see the TTFT p90 latency graph shown in Figure 2 using the preceding Prometheus query.

Screenshot of Prometheus graph for query of  histogram_quantile TTFT 90th percentile for requests over the last 1 minute to the rag-nim-llm.
Figure 2. Prometheus graph for query of the histogram_quantile TTFT 90th percentile

Execute the following steps to create the custom metric time_to_first_token_p90. Add it to the prometheus-adapter and then create an HPA resource for the LLM NIM that uses the metric to scale.

To add the custom metric time_to_first_token_p90, you need to update or append the new metric to existing prometheus-adapter custom metrics (if present).

Back up existing prometheus-adapter custom metrics configured as ConfigMap rules and save a copy prometheus-adapter-new-rules.yaml that will be updated.

kubectl get configmap prometheus-adapter -n prometheus -o yaml > 
prometheus-adapter-rules-backup.yaml

cp prometheus-adapter-rules-backup.yaml prometheus-adapter-new-rules.yaml

Then edit the prometheus-adapter-new-rules.yaml and add to the existing rules or custom metrics.

Add a new rule for custom metric time_to_first_token_p90:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter
  namespace: prometheus
data:
  config.yaml: |
    rules:    
    # TTFT p90 - 90th percentile TTFT metric
    - seriesQuery: 'time_to_first_token_seconds_bucket{namespace="rag"}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          service: {resource: "service"}
      name:
        matches: "^time_to_first_token_seconds_bucket$"
        as: "time_to_first_token_p90"
      metricsQuery: |
        histogram_quantile(0.90, sum(rate(time_to_first_token_seconds_bucket{<<.LabelMatchers>>}[1m])) by (le, namespace, service))

Next, apply the custom metric prometheus-adapter-new-rules.yaml. Restart the Prometheus Adapter and wait 30 seconds for the pods to be ready.

kubectl apply -f  prometheus-adapter-new-rules.yaml -n prometheus
kubectl rollout restart deployment prometheus-adapter -n prometheus

Verify that this custom metric time_to_first_token_p90 is available for query from the metrics server:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/rag/services/*/time_to_first_token_p90" | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "rag",
        "name": "rag-nim-llm",
        "apiVersion": "/v1"
      },
      "metricName": "time_to_first_token_p90",
      "timestamp": "2025-11-07T14:33:53Z",
      "value": "2228m",
      "selector": null
    }
  ]
}

Next, create the LLM NIM HPA resource nim-llm-hpa.yaml to autoscale on that custom metric time_to_first_token_p90:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-llm-hpa
  namespace: rag
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Statefulset
    name: rag-nim-llm
  minReplicas: 1
  maxReplicas: 6
  metrics:
  - type: Object
    object:
      metric:
        name: time_to_first_token_p90
      describedObject:
        apiVersion: v1
        kind: Service
        name: rag-nim-llm
      target:
        type: Value
        value: "2" # Scale when TTFT p90 > 2 seconds

Apply the LLM NIM HPA resource nim_llm_hpa.yaml:

kubectl apply -f  rag_nim_llm_hpa.yaml -n rag

Verify that LLM NIM HPA has been applied:

kubectl get hpa -n rag 
NAME          REFERENCE               TARGETS  MINPODS  MAXPODS  REPLICAS  AGE
nim-llm-hpa  StatefulSet/rag-nim-llm  0/1     1        6        1         32s

Step 5: Generate traffic load using GenAI-Perf

Generate traffic with GenAI-Perf or AI Perf to simulate load on the RAG ecosystem. You can run a pod on your cluster as described in Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes.

Run the rag-sweep-hpa.sh script shown below that runs through multiple concurrent requests (CR= 50, 100, 150, 200, 250, 300) to enable autoscaling the LLM NIM service from 1 to 6 as the TTFT p90 increases over 2s (TTFT p90 > 2s). 

For this use case, each time the concurrency is increased by 50, the TTFT p90 exceeds 2 s, which triggers the HPA and autoscaling of the LLM NIM. Set the number of requests per concurrency, request_multiplier = 15, to provide sufficient time for the LLM NIM to become ready for service before the next scale out is triggered. 

The time taken for the LLM NIM to scale out and become ready depends on model size (8B versus 49B) and whether NIM caching for images is enabled and the type of storage used to cache the NIM. Using block storage (longhorn.io or rook ceph) instead of remote NFS based file storage could shave off several minutes of downtime, and bring the LLM NIM pod into service more quickly.

#!/bin/bash

export RAG_SERVICE="rag-server:8081" #rag-server:port
export NIM_MODEL="nvidia/llama-3.3-nemotron-super-49b-v1.5" 
export NIM_MODEL_NAME="llama-3_3-nemotron-super-49b-v1.5" 
export NIM_MODEL_TOKENIZER="nvidia/Llama-3_3-Nemotron-Super-49B-v1" 

export CONCURRENCY_RANGE="50 100 150 200 250 300" #loop through the concurrency range to autoscale nim-llm
export request_multiplier=15 #number of requests per concurrency

#RAG specific parameters sent to rag-server
export ISL="256" # Input Sequence Length (ISL) inputs to sweep over
export OSL="256" # Output Sequence Length (OSL) inputs to sweep over

export COLLECTION="multimodal_data"
export VDB_TOPK=10
export RERANKER_TOPK=4
export OUTPUT_DIR="../results"

for CR in ${CONCURRENCY_RANGE}; do

  total_requests=$((request_multiplier * CR))
  EXPORT_FILE=RAG_CR-${CR}_ISL-${ISL}_OSL-${OSL}-$(date +"%Y-%m-%d-%H_%M_%S").json

  START_TIME=$(date +%s)
  genai-perf profile \
    -m $NIM_MODEL_NAME \
    --endpoint-type chat \
    --streaming -u $RAG_SERVICE \
    --request-count $total_requests \
    --synthetic-input-tokens-mean $ISL \
    --synthetic-input-tokens-stddev 0 \
    --concurrency $CR \
    --output-tokens-mean $OSL \
    --extra-inputs max_tokens:$OSL \
    --extra-inputs min_tokens:$OSL \
    --extra-inputs ignore_eos:true \
    --extra-inputs collection_name:$COLLECTION \
    --extra-inputs enable_reranker:true \
    --extra-inputs enable_citations:false \
    --extra-inputs enable_query_rewriting:false \
    --extra-inputs vdb_top_k:$VDB_TOPK \
    --extra-inputs reranker_top_k:$RERANKER_TOPK \
    --artifact-dir $OUTPUT_DIR \
    --tokenizer $MODEL \
    --profile-export-file $EXPORT_FILE \
    -- -v --max-threads=$CR
  END_TIME=$(date +%s)
  elapsed_time=$((END_TIME - START_TIME))
  
  echo "[$(date +"%Y-%m-%d %H:%M:%S")] Completed: $EXPORT_FILE in $elapsed_time seconds"
done
List of GenAI-Perf results for Chat ISL/OSL 256/256 and concurrency=100, including time to first token, time to second token, request latency, and more.
Figure 3. GenAI-Perf results for Chat ISL/OSL 256/256 and concurrency=100

Step 6: Verify LLM NIM autoscaling

Using the rag-sweephpa.sh script, run through multiple traffic generation runs by varying the concurrency: 50, 100, 150, 200, 250, and 300. As the concurrent requests increase in steps of 50,  the HPA latency metric  TTFT p90 grows over 2 seconds, then LLM NIM autoscales. The LLM NIM takes some time (~2-5 minutes) to become ready for service.

kubectl get hpa -n rag -w
NAME         REFERENCE                TARGETS   MINPODS   MAXPODS   REPLICAS  AGE
nim-llm-hpa  StatefulSet/rag-nim-llm  845m/2    1         6         1         1d
nim-llm-hpa  StatefulSet/rag-nim-llm  1455m/2   1         6         1         1d
nim-llm-hpa  StatefulSet/rag-nim-llm  2317m/2   1         6         2         1d >> LLM autoscale out

nim-llm-hpa  StatefulSet/rag-nim-llm  710m/2    1         6         2         1d
nim-llm-ttft  StatefulSet/rag-nim-llm  2151m/2   1         6         3         1d >> LLM autoscale out

Verify the number of LLM NIM pods have been autoscaling and new pods are available:

kubectl get pods -n rag | grep rag-nim-llm
NAME                    READY   STATUS    RESTARTS      AGE
rag-nim-llm-0           1/1     Running   0             10m
rag-nim-llm-1           1/1     Running   0             5m
rag-nim-llm-2           1/1     Running   0             3m

Import the Grafana dashboard for LLM NIM to watch the TTFT p90 load for the LLM NIM service across all LLM NIM pods as they scale out when TTFT p90 >2s. 

Screenshot of Grafana dashboard for LLM NIM to view the TTFT p90 and request load for the LLM NIM service, showing 1.62 s and 138 ms at the top, with many other metrics.
Figure 4. Grafana dashboard for LLM NIM to view the TTFT p90 and request load for the LLM NIM service

HPA also scales down over time. The time period to wait before scaling down is dictated by the --horizontal-pod-autoscaler-downscale-stabilization flag, which defaults to 5 minutes. This means that scale downs occur gradually, smoothing out the impact of rapidly fluctuating metric values. Wait five minutes and check the scale down.

Step 7: HPA scale-up and scale-down stabilization 

For the LLM NIM HPA, define the scale up (faster response) and scale down stabilization: 

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-llm-hpa
  namespace: rag
spec:
  behavior:
    scaleUp:
      # Scale up faster response
      stabilizationWindowSeconds: 60
      policies:
      # Allow up to 100% increase (double) every 30 seconds
      - type: Percent
        value: 100
        periodSeconds: 30
      # add 1 pod at a time
      - type: Pods
        value: 1 
        periodSeconds: 30
        selectPolicy: Max
    scaleDown:
      # Stabilization with conservative scale down
      stabilizationWindowSeconds: 300
      policies:
      # Max 50% decrease every 60 seconds
      - type: Percent
        value: 50
        periodSeconds: 120

Step 8: Autoscale LLM NIM with concurrent requests

To autoscale the LLM NIM, there are several available metrics previously listed, including CRs num_requests_running and KV cache gpu_cache_usage_perc.  

The next step is how to scale using concurrency or queue depth—the number of CRs being processed by the LLM using the LLM NIM metric num_requests_running.

First verify that the metric num_requests_running exists and is available to use by the Kubernetes HPA for the LLM NIM service:

kubectl get --raw '/apis/custom.metrics.k8s.io/v1beta1/namespaces/rag/services/*/num_requests_running' | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "rag",
        "name": "rag-nim-llm",
        "apiVersion": "/v1"
      },
      "metricName": "num_requests_running",
      "timestamp": "2025-11-21T20:23:45Z",
      "value": "0",
      "selector": null
    }
  ]
}

Create or update the LLM NIM HPA resource nim-llm-hpa.yaml that uses concurrency metric  num_requests_running to autoscale LLM NIM and apply config. Based on the following config, HPA will autoscale LLM NIM when the average CR per LLM pod exceeds 60 with average num_requests_running >60 requests per pod. 

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-llm-hpa
  namespace: rag
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Statefulset
    name: rag-nim-llm
  minReplicas: 1
  maxReplicas: 6
  metrics:
    - type: Object
    object:
    metric:
      name: num_requests_running
      describedObject:
        apiVersion: v1
        kind: Service
        name: rag-nim-llm
      target:
        type: AverageValue
        averageValue: "60" #scale when average concurrent requests >60 requests/pod

Use the GenAI-Perf load with concurrent requests CR=50, 100, 150, 200, 250, 300 to autoscale LLM NIM. Monitor HPA and LLM NIM pods.

kubectl get hpa -n rag -w
NAME         REFERENCE             TARGETS      MINPODS MAXPODS  REPLICAS AGE
nim-llm-hpa  StatefulSet/nim-llm  48/60 (avg)   1       6        1        1d
nim-llm-hpa  StatefulSet/nim-llm  77/60 (avg)   1       6        2        1d >> LLM autoscale out

Use the Prometheus query from http://localhost:9090/query? to graph the query num_requests_running and get a graph showing autoscaling (Figure 5).

When the concurrent requests CR=150, the requests are split between two LLMs (for example, CRs = 85,65 with average num_requests_running = 75 requests/pod). Since this is more than HPA target (>60 requests per pod), the third LLM autoscales. Now the concurrent requests CR=150 are split across three LLMs= 59,47,42; average num_requests_running = 50, which is below the HPA target <60.

Prometheus graph for query num_requests_running with autoscaling concurrent requests CR=150.
Figure 5. Prometheus graph for the query num_requests_running with autoscaling concurrent requests CR=150

Step 9: Autoscale reranking and embedding using GPU utilization

For the RAG retrieval pipeline, autoscale the Nemotron reranking and embedding NIM microservices based on the GPU usage metric gpu_utilization. It measures the percentage of GPU utilized for processing, reranking and embedding requests.

First, verify that the metric gpu_utilization exists and is available to use by the Kubernetes HPA for both embedding and reranking NIM services:

kubectl get --raw '/apis/custom.metrics.k8s.io/v1beta1/namespaces/rag/services/*/gpu_utilization' | jq .
{
  "kind": "MetricValueList",
  "apiVersion": "custom.metrics.k8s.io/v1beta1",
  "metadata": {},
  "items": [
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "rag",
        "name": "nim-embedding",
        "apiVersion": "/v1"
      },
      "metricName": "gpu_utilization",
      "timestamp": "2025-11-24T03:01:21Z",
      "value": "0",
      "selector": null
    },
    {
      "describedObject": {
        "kind": "Service",
        "namespace": "rag",
        "name": "nim-reranking",
        "apiVersion": "/v1"
      },
      "metricName": "gpu_utilization",
      "timestamp": "2025-11-24T03:01:21Z",
      "value": "0",
      "selector": null
    }
  ]
}

Create an HPA nim-reranking-embedding-hpa.yaml for the reranking and embedding NIM microservices using the GPU usage metric gpu_utilization and apply. Based on the following configuration, HPA will autoscale the reranking and embedding NIM microservices when the average GPU usage percentage per pod exceeds 75%.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-reranking-hpa
  namespace: rag
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nim-reranking
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: Object
    object:
    metric:
      name: gpu_utiliization
      describedObject:
        apiVersion: v1
        kind: Service
        name: nim-reranking
      target:
        type: AverageValue
        averageValue: "0.75" #scale when average GPU usage >75% per pod
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-embedding-hpa
  namespace: rag
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nim-reranking
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: Object
    object:
    metric:
      name: gpu_utiliization
      describedObject:
        apiVersion: v1
        kind: Service
        name: nim-embedding
      target:
        type: AverageValue
        averageValue: "0.75" #scale when average GPU usage >75% per pod

Use GenAI-Perf load with concurrent requests CR=100,150, 200, 250, 300 to autoscale the reranking NIM. When the concurrency load becomes too high at CR=200, the GPU utilization on the reranking NIM exceeds 75%. This results in scaling out of the second reranking NIM pod along with autoscaling of the fourth LLM NIM pod.

kubectl get hpa -n rag -w
NAME        REFERENCE            TARGETS        MINPODS MAXPODS REPLICAS AGE
nim-embed   Deployment/nim-embed 30m/750m (avg) 1       3       1        1d
nim-llm-hpa StatefulSet/nim-llm  69/60 (avg)    1       6       4        1d >> LLM NIM autoscale out
nim-rerank  Deployment/nim-rerank 800m/750m (avg) 1     3       2        1d >> reranking NIM autoscale out

kubectl get pods -n rag 
NAME                              READY STATUS            RESTARTS     AGE
nim-embedding-6dd94bbcdb-p8blw    1/1   Running            0           1d
rag-nim-llm-3                     1/1   Running            0           2m10s           
rag-nim-llm-0                     1/1   Running            0           1d     
rag-nim-llm-2                     1/1   Running            0           6m36s  
rag-nim-llm-1                     1/1   Running            0           4m27s     
nim-reranking-d45d9997f-v2rsm     1/1   Running            0           1d    
nim-reranking-d45d9997f-xhd5f     0/1   ContainerCreating  0           33s 

Use the Prometheus query http://localhost:9090/query? to graph the query gpu_utilization and get a graph showing autoscaling of reranking NIM (Figure 6). When CR=200, GPU usage metric gpu_utilization >75% and causes autoscaling of the second reranking NIM. 

 Screenshot of Prometheus graph showing autoscaling of Reranking NIM with GPU usage metric  gpu_utilization >75%.
Figure 6. Prometheus graph showing autoscaling of the reranking NIM with GPU usage metric  gpu_utilization >75%

Figure 7 shows a Grafana dashboard for GPU utilization per pod as concurrency load is increased over time CR=100, 150, 200, 250, and 300, causing autoscaling of the LLM NIM from one to four pods and reranking NIM from one to two pods.

Screenshot of Grafana dashboard for GPU utilization per pod as concurrency is increased.
Figure 7. Grafana dashboard for GPU utilization per pod as concurrency is increased

Get started with horizontal autoscaling on Kubernetes

This post has walked you through how to autoscale key NVIDIA NIM microservices in the RAG pipeline for one key RAG use case, the customer service chatbot (ISL/OSL 256/256) with stringent performance load and latency requirements. 

We used the Kubernetes HPA to autoscale the main bottleneck, LLM NIM, using available metrics and custom metrics, including the following: 

  • Concurrency: Queue depth of concurrent requests (num_requests_running) 
  • KV Cache: Request processing cache memory (gpu_cache_usage_perc) 
  • TTFT custom metric: 90th percentile of TTFT p90 (time_to_first_token_p90) 

We also demonstrated how to use HPA to autoscale the reranking NIM and embedding NIM along with the LLM NIM using GPU resource usage metric (gpu_utilization).

Using autoscaling with these metrics, we were able to scale the RAG customer service chatbot to higher concurrencies (CR=300) while maintaining the latency SLAs (TTFT p90 ≤ 2s).

To get started deploying RAG and learn more about solutions based on NVIDIA blueprints for RAG, check out these related resources:

Discuss (0)

Tags