Data Science

Horizontal Autoscaling of NVIDIA NIM Microservices on Kubernetes

Decorative image of two cartoon llamas in sunglasses.

Jan 22, 2025

By Juana Nakfour

Discuss (2)

AI-Generated Summary

Dislike

To scale NVIDIA NIM microservices in a Kubernetes environment, you need to set up a Kubernetes cluster with components like Kubernetes Metrics Server, Prometheus, Prometheus Adapter, and Grafana to enable metric scraping and autoscaling.
The NVIDIA NIM for LLMs microservice exposes a Prometheus endpoint with metrics like gpu_cache_usage_perc, which can be used as a basis for autoscaling using Kubernetes Horizontal Pod Autoscaling (HPA).
By creating an HPA resource that scales based on the gpu_cache_usage_perc metric, you can automatically scale up or down the number of replicas of your NIM microservice in response to changes in concurrency and KV cache utilization.
The HPA scales down gradually after the load decreases, with the scale-down timing controlled by the --horizontal-pod-autoscaler-downscale-stabilization flag, which defaults to 5 minutes.

AI-generated content may summarize information incompletely. Verify important information. Learn more

As of March 18, 2025, NVIDIA Triton Inference Server is now part of the NVIDIA Dynamo Platform and has been renamed to NVIDIA Dynamo Triton, accordingly.

NVIDIA NIM microservices are model inference containers that can be deployed on Kubernetes. In a production environment, it’s important to understand the compute and memory profile of these microservices to set up a successful autoscaling plan.

In this post, we describe how to set up and use Kubernetes Horizontal Pod Autoscaling (HPA) with an NVIDIA NIM for LLMs model to automatically scale up and down microservices based on specific custom metrics.

Prerequisites

To follow along with this tutorial, you need the following list of prerequisites:

An NVIDIA AI Enterprise license
- NVIDIA NIM for LLMs is available for self-hosting under the NVIDIA AI Enterprise License. Deploying NIM for LLMs in your cluster requires generating an NGC API KEY so the Kubernetes cluster can download the container image.
A Kubernetes cluster version 1.29 or later (we used DGX Cloud Clusters)
Admin access to the Kubernetes cluster
Kubernetes CLI tool kubectl installed
HELM CLI installed

Setting up a Kubernetes cluster

The first step in this tutorial is to set up your Kubernetes cluster with the appropriate components to enable metric scrapping and availability to the Kubernetes HPA service. This requires the following components:

Kubernetes Metrics Server
Prometheus
Prometheus Adapter
Grafana

Kubernetes Metrics Server

Metrics Server is responsible for scraping resource metrics from Kubelets and exposes them in Kubernetes API Server through the Metrics API. This is used by both the Horizontal Pod Autoscaler and the kubectl top command.

To install the Kubernetes Metric Server, use Helm.

helm repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/
helm upgrade --install metrics-server metrics-server/metrics-server

Prometheus and Grafana

Prometheus and Grafana are well known tools for scraping metrics from pods and creating dashboards. To install Prometheus and Grafana, use the kube-prometheus-stack Helm chart that includes many different components.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install [RELEASE_NAME] prometheus-community/kube-prometheus-stack

The Prometheus adapter exposes the scraped metrics from Prometheus in the Kubernetes apiserver through Metrics API. This enables HPA to use custom metrics from pods to make scaling strategies.

To install the Prometheus adapter in the same namespace as Prometheus and Grafana, use the following Helm commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install <name> prometheus-community/prometheus-adapter -n <namespace>

Make sure that the Prometheus adaptor is pointing to the correct Prometheus service endpoint. In this case, I had to edit the deployment and correct the URL.

kubectl edit deployment prom-adapter-prometheus-adapter -n prometheus
spec:
      affinity: {}
      containers:
      - args:
        - /adapter
        - --secure-port=6443
        - --cert-dir=/tmp/cert
        - --prometheus-url=http://prometheus-prometheus.prometheus.svc:9090
        - --metrics-relist-interval=1m
        - --v=4
        - --config=/etc/adapter/config.yaml
        image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.12.0

If everything is set up right, you should see metrics from Prometheus using the following command:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/*/gpu_cache_usage_perc?selector=app%3Dmeta-llama3-8b"

{"kind":"MetricValueList","apiVersion":"custom.metrics.k8s.io/v1beta1","metadata":{},"items":[{"describedObject":{"kind":"Pod","namespace":"<namespace>","name":"meta-llama3-70b-5db5f7dd89-tvcwl","apiVersion":"/v1"},"metricName":"gpu_cache_usage_perc","timestamp":"2025-01-02T20:13:15Z","value":"1m","selector":null},{"describedObject":{"kind":"Pod","namespace":"<namespace>","name":"meta-llama3-8b-5c6ddbbfb5-dp2mv","apiVersion":"/v1"},"metricName":"gpu_cache_usage_perc","timestamp":"2025-01-02T20:13:15Z","value":"14m","selector":null}]}

Deploying a NIM microservice

In this tutorial, you use NIM for LLMs as a microservice to scale, specifically using model meta/llama-3.1-8b-instruct. There are multiple options for deploying a NIM microservice:

Using Helm
Using the NIM Operator

After deployment, you should note the service name and namespace of your NIM for LLMs microservice, as this will be used in many commands.

NIM for LLMs already exposes a Prometheus endpoint with many interesting metrics. To see the endpoint, use the following commands:

kubectl -n <namespace> port-forward svc/<service-name> 8080

From a browser, go to localhost:8080/metrics and look for the specific metric named gpu_cache_usage_perc. In this post, you use this metric as a basis for autoscaling. This metric shows the percent utilization of the KV cache and is reported by the vLLM stack.

You will use the NIM for LLMs Grafana dashboard to observe these metrics. Download the JSON dashboard and upload it to your Grafana instance. To log in to the dashboard, see the Grafana access instructions.

After loading the NIM for LLMs dashboard, you should see a similar dashboard as Figure 1. (I had both 70b and 8b deployed for that dashboard, hence the double KV cache numbers.)

Now that you have your observability stack and microservice deployed, you can start generating traffic and observing metrics in the Grafana dashboard. The tool to use for generating traffic is genai-perf.

To run this tool from a pod on your cluster, follow these steps and make sure to install in the same namespace as your NIM for LLMs microservice.

Create a pod with NVIDIA Triton:

kubectl run <pod-name> --image=nvcr.io/nvidia/tritonserver:24.10-py3-sdk -n <namespace> --command -- sleep 100000

Log into the pod and now you can run the genai-perf CLI:

kubectl exec --stdin --tty <pod-name> -- /bin/bash
genai-perf --help

To send traffic to model meta/llama-3.1-8b-instruct requires genai-perf to download the appropriate tokenizer from Hugging Face. Get the API key from Hugging Face and log in.

pip install --upgrade huggingface_hub[cli]
export HF_TOKEN=<hf-token>
huggingface-cli login --token $HF_TOKEN

Set up the correct environment variables and generate traffic. For more information about the different parameters, see the genai-perf documentation. The model name and service name must be accurate and reflect your setup.

export INPUT_SEQUENCE_LENGTH=200
export INPUT_SEQUENCE_STD=10
export OUTPUT_SEQUENCE_LENGTH=200
export CONCURRENCY=10
export MODEL=meta/llama-3.1-8b-instruct
genai-perf profile \
             -m $MODEL  \
             --endpoint-type chat  \
             --service-kind openai  \
             --streaming     -u meta-llama3-8b:8080    \ 
             --synthetic-input-tokens-mean $INPUT_SEQUENCE_LENGTH    \             
             --synthetic-input-tokens-stddev $INPUT_SEQUENCE_STD \
             --concurrency $CONCURRENCY    \
             --output-tokens-mean $OUTPUT_SEQUENCE_LENGTH  \
             --extra-inputs max_tokens:$OUTPUT_SEQUENCE_LENGTH   \
             --extra-inputs min_tokens:$OUTPUT_SEQUENCE_LENGTH  \
             --extra-inputs ignore_eos:true  \
             --tokenizer meta-llama/Meta-Llama-3-8B-Instruct   
             --     -v     --max-threads=256

For this post, I created multiple traffic generation runs by varying the concurrency number: 100, 200, 300, and 400. From the Grafana dashboard, you can see the KV cache utilization percentage (Figure 2). The KV cache percent utilization is increasing with each concurrency trial, from 9.40% at 100 concurrency all the way to 40.9% at 400 concurrency. You can also change other relevant parameters, such as input and output sequence length, and observe the impact on KV cache utilization.

Now that you’ve observed the impact of concurrency on KV cache utilization, you can create the HPA resource. Create the HPA resource to scale based on the gpu_cache_usage_perc metric:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: gpu-hpa-cache
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: meta-llama3-8b
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_cache_usage_perc
      target:
        type: AverageValue
        averageValue: 100m

kubectl create -f hpa-gpu-cache.yaml -n <namespace> 
kubectl get hpa -n <namepsace> -w

NAME            REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa-cache   Deployment/meta-llama3-8b   9m/100m   1         10        1          3m37s

Run genai-perf at different concurrencies (10, 100, 200) and watch the HPA metric increase:

kubectl get hpa -n <namepsace> -w

NAME            REFERENCE                   TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
gpu-hpa-cache   Deployment/meta-llama3-8b   9m/100m   1         10        1          3m37s
gpu-hpa-cache   Deployment/meta-llama3-8b   8m/100m   1         10        1          4m16s
gpu-hpa-cache   Deployment/meta-llama3-8b   1m/100m   1         10        1          4m46s
gpu-hpa-cache   Deployment/meta-llama3-8b   33m/100m   1         10        1          5m16s
gpu-hpa-cache   Deployment/meta-llama3-8b   56m/100m   1         10        1          5m46s
gpu-hpa-cache   Deployment/meta-llama3-8b   39m/100m   1         10        1          6m16s
gpu-hpa-cache   Deployment/meta-llama3-8b   208m/100m   1         10        1          6m46s
gpu-hpa-cache   Deployment/meta-llama3-8b   208m/100m   1         10        3          7m1s
gpu-hpa-cache   Deployment/meta-llama3-8b   293m/100m   1         10        3          7m16s
gpu-hpa-cache   Deployment/meta-llama3-8b   7m/100m     1         10        3          7m46s

Check the number of pods and you should see that autoscaling added two new pods:

kubectl get pods -n <namespace>

NAME                                       READY   STATUS              RESTARTS      AGE
meta-llama3-8b-5c6ddbbfb5-85p6c            1/1     Running             0             25s
meta-llama3-8b-5c6ddbbfb5-dp2mv            1/1     Running             0             146m
meta-llama3-8b-5c6ddbbfb5-sf85v            1/1     Running             0             26s

HPA also scales down. The time period to wait before scaling down is dictated by the --horizontal-pod-autoscaler-downscale-stabilization flag, which defaults to 5 minutes. This means that scale downs occur gradually, smoothing out the impact of rapidly fluctuating metric values. Wait 5 minutes and check the scale down.

kubectl get pods -n <namespace>

NAME                                       READY   STATUS              RESTARTS      AGE
meta-llama3-8b-5c6ddbbfb5-dp2mv            1/1     Running             0             154m

Conclusion

In this post, I described how to set up your Kubernetes cluster to scale on custom metrics and showed how you can scale a NIM for LLMs based on the KV cache utilization parameter.

There are many advanced areas to explore further in this topic. For example, many other metrics could also be considered for scaling, such as request latency, request throughput, and GPU compute utilization. You can scale on multiple metrics in one HPA resource and scale accordingly.

Another area of interest is the ability to create new metrics using Prometheus Query Language (PromQL) and add them to the configmap of the Prometheus adapter so that HPA can scale.

Discuss (2)

About the Authors

About Juana Nakfour
Juana Nakfour is a principal software architect on the NVIDIA NIM Blueprints team. Juana is a lead software architect and has been involved in the delivery of many NVIDIA products such as NIM, NIM Blueprints, and NVIDIA cloud functions. Before joining NVIDIA, Juana worked as a lead engineer and open source project maintainer at Red Hat contributing to Kubeflow and Open Data Hub. Juana is also an avid inventor with multiple awarded patents. Outside of work, Juana spends her time raising two kids, enjoying the outdoors hiking, skiing, mountain biking, and of course anything on or in the ocean.

View all posts by Juana Nakfour