Data Center / Cloud

Smart Multi-Node Scheduling for Fast and Efficient LLM Inference with NVIDIA Run:ai and NVIDIA Dynamo

The exponential growth in large language model complexity has created challenges, such as models too large for single GPUs, workloads that demand high throughput and low latency, and infrastructure that must coordinate thousands of interconnected components seamlessly. The NVIDIA Run:ai v2.23 release addresses these challenges through an integration with NVIDIA Dynamo—a high-throughput, low-latency inference framework designed for serving generative AI models across distributed environments.

In this blog, we’ll cover:

  • The scaling problem of today’s workloads that require multi-node inference with multiple components, and the coordination challenges that come with it.
  • How Dynamo accelerates inference, why scheduling matters, and the role of orchestration in making workloads efficient at scale.
  • The role of NVIDIA Run:ai v2.23 Dynamo integration in gang scheduling and topology-aware placement for predictable, low-latency deployments.
  • How to get started with Dynamo on NVIDIA Run:ai with a step-by-step guide for setting up network topology and deploying Dynamo on NVIDIA Run:ai with these capabilities enabled.

The scaling problem

As model parameters and the number of distributed components (e.g., prefill and decode workers, router, etc.) increase, their memory requirements and computational demands grow significantly. This forces us to distribute model layers and the KV cache across multiple GPUs, and increasingly, across multiple nodes. While techniques like tensor parallelism solve the capacity challenge, they introduce a coordination challenge: how do you make dozens of distributed components work together as seamlessly as a single accelerator? The answer lies in advanced inference frameworks that can manage this complexity transparently.

How Dynamo accelerates inference

NVIDIA Dynamo was purpose-built to tackle distributed inference challenges through features including:

  • Disaggregated prefill and decode inference that maximizes GPU throughput and enables trade-offs between latency and throughput.
  • Dynamic GPU scheduling that adapts to fluctuating demand.
  • LLM-aware request routing to prevent unnecessary KV cache re-computation.
  • Accelerated data transfer that uses NVIDIA Inference Xfer Library (NIXL) to reduce inference response times.
  • KV cache offloading that uses multiple memory hierarchies for higher throughput.

These capabilities ensure that even the largest models can run efficiently across distributed GPU clusters, but only if the underlying orchestration doesn’t get in the way.

Why scheduling matters: to run Dynamo workloads efficiently at scale

Running multi-node inference in clusters has challenges. Dynamo workloads involve tightly coupled components like routers, prefill, and decode. Scheduling these independently can lead to partial deployments, such as decode pods running while prefill pods remain pending, resulting in idle GPUs. 

Even with all components active, poor placement hurts performance. Leaders and workers spread across distant nodes cause latency and reduce throughput due to cross-rack communication and bandwidth bottlenecks. Addressing these orchestration issues is vital to complementing Dynamo’s runtime efficiency within the cluster. This is where NVIDIA Run:ai’s advanced scheduling capabilities become essential.

NVIDIA Run:ai meets Dynamo

Addressing orchestration challenges requires more than just starting pods. It requires starting the right pods together and placing them in the right locations. This is exactly what NVIDIA Run:ai brings to Dynamo with two key capabilities: gang scheduling to launch components atomically, and topology-aware placement to co-locate them for low-latency communication.

Gang scheduling: all-or-nothing deployment

Dynamo workloads now use NVIDIA Run:ai’s gang scheduling capabilities, treating different groups of interdependent pods as a single deployment unit. This atomic scheduling approach ensures that either all required components (prefill workers and leaders, decode workers and leaders) can be placed simultaneously, or the deployment waits until sufficient resources are available.

By eliminating partial deployment scenarios, higher cluster utilization emerges naturally as resource fragmentation disappears. Partially deployed workloads no longer consume cluster resources while waiting indefinitely for missing components. Cold start lag is also reduced because when resources become available, entire workloads launch atomically rather than spinning up incrementally, shortening time-to-service.

The result is predictable, efficient placement for multi-node inference workloads with no additional configuration requirements; the scheduler manages this coordination automatically.

Topology-aware scheduling: reducing latency

The integration includes topology-aware scheduling that is particularly valuable for multi-node deployments. Administrators can define a cluster’s topology, enabling the scheduler to make strategic component placement decisions. Interdependent components (such as prefill and decode roles) are positioned to minimize cross-node latency while maximizing the utilization of high-speed interconnects.

This topology awareness becomes critical at scale for multi-node deployments, where network communication can easily become the bottleneck. The result is improved communication throughput between components and reduced network overhead, for lower latency and enhanced performance for large-scale distributed workloads.

How to get started with NVIDIA Run:ai v2.23 together with Dynamo 

Ensure you have the following before continuing:

  • A Kubernetes cluster with NVIDIA Run:ai v.2.23 installed and a project named runai-project-a is initialized (see documentation).
  • Access to kubeconfig file.
  • Helm installed.
  • A Hugging Face access token for pulling models, stored as a Kubernetes secret using a personal token. 
kubectl create secret generic hf-token-secret \
  --from-literal=HF_TOKEN='<huggingface_token>' \
  -n runai-project-a

Note: Replace <huggingface_token> with your actual Hugging Face token. Keep this token secure and never commit it to source control.

Setting up network topology

To co-locate tightly coupled Dynamo components and cut cross-node latency, configure a network topology in NVIDIA Run:ai that represents your cluster’s physical layout. Start by ensuring your Kubernetes nodes are labeled with proximity indicators such as topology.kubernetes.io/region: us-west, topology.kubernetes.io/zone: us-west-1a, etc

Next, specify in NVIDIA Run:ai which label keys define proximity. In the NVIDIA Run:ai user interface, open the cluster’s settings and add the label keys you use (for example, topology.kubernetes.io/zone, topology.kubernetes.io/region). 

Create a topology by ordering these keys from closest to farthest. Make sure the label values you use in the network topology setup (e.g., us-west-1a) match what you applied on the nodes exactly: 

The Run:ai user interface setting up network topology.
Figure 1. Setting up network topology on NVIDIA Run:ai 

Then, attach this network topology to the relevant node pool(s) from the node pools view. Different pools can carry different topologies if your hardware or network fabrics vary by pool. 

From this point on, scheduling is automatic. NVIDIA Run:ai applies a “preferred” soft constraint at the closest tier first and only relaxes to broader tiers if the cluster can’t place the entire gang together at the initial level. Combined with gang scheduling, this ensures your Dynamo pods land together at the best-available nodes (for example, nodes in the same rack) or wait until they can, eliminating partial, inefficient deployments. For more information, please refer to the official documentation page

The Run:ai user interface assigning network topology to node pools.
Figure 2. Attaching network topology with a node pool on NVIDIA Run:ai

Dynamo in action 

Once the network topology is configured in the NVIDIA Run:ai user interface, Dynamo workloads automatically use gang scheduling and topology-aware scheduling. This ensures tightly coupled components (e.g., decode, router) launch together or wait as a group, while the scheduler co-locates them on the nearest tier (e.g., the same zone or rack) to reduce latency. Users can specify preferred or required placement strategies by annotating their workloads.

Step 1. Set environment variables

# Define the required environment variables

export DYNAMO_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.1
export NAMESPACE=dynamo-cloud
export RELEASE_VERSION=0.4.1

Step 2. Create a Kubernetes namespace

# Create a dedicated namespace for the deployment

kubectl create namespace $NAMESPACE

Step 3. Install the custom resource definitions (CRDs) and platform components

# CRDs

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds-$RELEASE_VERSION.tgz

helm install dynamo-crds dynamo-crds-${RELEASE_VERSION}.tgz --namespace dynamo-cloud

# Platform Components

helm fetch https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-platform-$RELEASE_VERSION.tgz

helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz --namespace ${NAMESPACE} --set dynamo-operator.namespaceRestriction.enabled=false

Step 4. Verify pod status

# Ensure that all components are running

kubectl -n $NAMESPACE get pods

Step 5: Deploy the vLLM aggregator
Download the example YAML from the Dynamo repository. Set metadata.namespace to runai-project-a, and add the following annotations:

metadata:
  namespace: runai-project-a
  annotations:
    kai.scheduler/topology-preferred-placement: "topology.kubernetes.io/zone"
    kai.scheduler/topology: "topology-1"
    # kai.scheduler/topology-required-placement: "topology.kubernetes.io/zone" -> if the pods have to be in the same zone, users can choose to use topology required placement label instead of the preferred placement
#Apply the YAML:

kubectl apply -f disagg.yaml

As pods start, you’ll see the operator, control plane, and all components running, with decode and prefill pods scheduled in the same zone based on topology.

NAME                                       READY   STATUS    RESTARTS   AGE
vllm-disagg-frontend-79f459c95-57fm6             1/1     Running   0   30m
vllm-disagg-vllmdecodeworker-6c8d64f569-56phf    1/1     Running   0   30m
vllm-disagg-vllmprefillworker-755cb88fcf-pflb5   1/1     Running   0   30m

Step 6. Send a request to the deployed model

To test the deployment locally, port-forward the frontend:

kubectl -n runai-project-a port-forward pod/vllm-disagg-frontend-79f459c95-57fm6  
 8000:8000

Send a sample request using curl:

curl localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-0.6B",
    "messages": [
    {
        "role": "user",
        "content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
    }
    ],
    "stream": false,
    "max_tokens": 30
  }'

A successful response returns a JSON with a generated completion:

{"id":"chatcmpl-559682f7-8845-4014-b670-47a5f32f07c6","choices":[{"index":0,"message":{"content":"<think>\nOkay, I need to develop a detailed character background for the explorer in Eldoria. Let me start by understanding the user's query.","role":"assistant","reasoning_content":null},"finish_reason":"stop"}],"created":1758043876,"model":"Qwen/Qwen3-0.6B","object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":29,"total_tokens":225}}%

The deployment uses NVIDIA Run:ai’s gang scheduling and topology-aware placement to start pods together, minimize latency, and maximize GPU utilization by avoiding idle resources.

Wrapping up

Large-scale LLM inference succeeds when a high-performance inference framework is paired with a scheduler that knows how to place and start it. NVIDIA Dynamo delivers the former with disaggregated prefill and decode, LLM-aware routing, and efficient KV cache management. NVIDIA Run:ai version 2.23 contributes the latter with gang scheduling and topology-aware placement.

Together, they make multi-node inference predictable and performant: pods launch atomically, components stay close on fast links, and GPUs remain busy. The result is higher throughput, lower latency, and better utilization across Kubernetes clusters, scaling reliably, and maximizing infrastructure return.

Looking for effective ways to overcome the challenges of scaling AI workloads? Join our upcoming webinar for expert insights and practical solutions.

Get started with NVIDIA Run:ai and Dynamo using the following resources:

Discuss (0)

Tags