Data Center / Cloud

Enabling Multi-Node NVLink on Kubernetes for NVIDIA GB200 NVL72 and Beyond

The NVIDIA GB200 NVL72 pushes AI infrastructure to new limits, enabling breakthroughs in training large-language models and running scalable, low-latency inference workloads. Increasingly, Kubernetes plays a central role for deploying and scaling these workloads efficiently whether on-premises or in the cloud. However, rapidly evolving AI workloads, infrastructure requirements, and new hardware architectures pose new challenges in Kubernetes orchestration and resource management.

In this post, we introduce a new Kubernetes abstraction called ComputeDomains to hide the complexity involved in ensuring each worker of a multi-node workload is able to perform secure, GPU-to-GPU memory operations across node boundaries over a multi-node NVLink fabric. 

Made available as part of the NVIDIA DRA driver for GPUs, ComputeDomains bridge low‑level GPU constructs (NVIDIA NVLink and NVIDIA IMEX) with modern Kubernetes‑native scheduling concepts (dynamic resource allocation, DRA for short) to provide the foundational support required for running distributed, multi-node workloads on modern GPU hardware. Without ComputeDomains, multi‑node NVLink setups would have to be manually defined and fixed in place, limiting the flexibility Kubernetes is designed to provide and coming at the cost of security isolation, fault isolation, and cost efficiency. 

While this work has been validated on NVIDIA DGX GB200, NVIDIA’s blueprint for GB200 NVL72 systems, ComputeDomains are designed to generalize across any current or future architecture that supports multi‑node NVLink, including future NVL576 systems.

In this post, we focus on the fundamentals: what ComputeDomains are, why they are important, and how you can use them to run your own distributed, multi-node workloads on Kubernetes.

From single-node to multi-node GPU computing

To understand why ComputeDomains are important, it helps to look briefly at how GPU system design has evolved over time.

Earlier generations of NVIDIA DGX systems maximized performance by packing as many GPUs as possible into a single server connected with high-bandwidth NVLink. This design delivered strong intra-node scaling but was limited to workloads that fit within a single system. With the introduction of NVIDIA Multi-Node NVLink (MNNVL), that limitation disappears. GPUs in different servers can now communicate at full NVLink bandwidth through NVIDIA NVLink Switches, transforming an entire rack into a single, unified GPU fabric. This enables seamless performance scaling across nodes and forms the basis for ultra-fast distributed training and inference. 

GPU communication libraries such as NVIDIA NCCL and NVIDIA NVSHMEM have been extended to exploit this fabric, while frameworks such as PyTorch build on top of them for fast cross-node, cross-GPU communication. These libraries automatically detect and use the fastest available fabric (e.g. NVLink, RDMA, InfiniBand, or Ethernet) so distributed applications achieve optimal performance without code changes, regardless of topology.

With ComputeDomains, we provide the recommended way to support Multi-Node NVLink  on Kubernetes. As such, they already serve as the common layer on top of which several higher-level components in the overall NVIDIA Kubernetes stack are built, including the KAI scheduler, NVIDIA Dynamo, and NVIDIA DGX Cloud Lepton. 

The following figure depicts the NVIDIA GB200 NVL72 rack topology used by DGX GB200 systems. This is just one example of the type of system that ComputeDomains unlock on Kubernetes.

A DGX GB200 system with 10 compute trays on top, eight compute trays on bottom, connected through nine NVLink switches in the middle, creating a fully connected mesh of 72 GPUs connected via multi-node NVLink (1.8 TB/s chip-to-chip; over 130 TB/s cumulative bandwidth).
Figure 1. A DGX GB200 system with 10 compute trays on top, eight compute trays on bottom, connected through nine NVLink switches in the middle, creating a fully connected mesh of 72 GPUs connected via multi-node NVLink  (1.8 TB/s chip-to-chip; over 130 TB/s cumulative bandwidth).

So, what goes into supporting multi-node NVLink on Kubernetes and how do ComputeDomains help with that? What’s key is the NVIDIA Internode Memory Exchange Service (IMEX), software at the GPU-driver level that lets GPUs communicate across nodes. With IMEX, every individual GPU memory export/import operation is subject to fine-grained access control. IMEX operates across a group of nodes known as an IMEX domain.

Please refer to the figure below to gain a better understanding of the relationship between NVLink domains, IMEX domains and the other levels of GPU partitioning that are possible in a multi-node NVLink environment.

A diagram showing layers of partitioning available in a multi-node NVLink environment.
Figure 2. Layers of partitioning available in a multi-node NVLink environment.

ComputeDomains can be thought of as a generalization of IMEX domains. While IMEX domains exist at the driver layer and define which nodes can communicate via NVLink, ComputeDomains generalize this concept and extend it into Kubernetes. They represent a higher‑level concept of connectivity (or reachability) between the distributed workers of a multi‑node workload. The fact that IMEX is used underneath to enable that connectivity is an implementation detail.

In essence, ComputeDomains dynamically create, manage, and tear down IMEX domains as multi‑node workloads are scheduled to nodes and run to completion.

Instead of requiring static, pre‑configured IMEX setups, ComputeDomains respond to scheduling events in real time, automatically forming IMEX domains around the set of nodes where a distributed job lands.

IMEX essentially provides re-configurable isolation boundaries—and ComputeDomains manage those in a fluid, transparent way. With ComputeDomains, each workload gets its own isolated IMEX domain and shared IMEX channel, ensuring GPU‑to‑GPU communication between all workers of a job while being securely isolated from other jobs. A ComputeDomain follows the workload and dynamically adjusts its topology as workload grows or shrinks. When workload finishes, its corresponding IMEX domain and channels are automatically cleaned up, freeing up resources for future jobs.

Isolation without compromising on utilization

As indicated above, IMEX primitives are meant to be an implementation detail hidden underneath the ComputeDomain abstraction. With that said, we argue that a robust, battle-tested solution for dynamically forming IMEX domains around a workload is fundamentally needed for three reasons:

  1. Security isolation: In a zero-trust environment, there is a clear need for neighboring GPU jobs to be securely isolated despite being physically NVLink-connected.
  2. Fault isolation: Neighboring jobs, even if trusted, must not step onto each other’s toes.
  3. Cost efficiency: Resource utilization must be kept high even with (1) and (2) in place, which is especially relevant in multi-tenant environments.

Security isolation could arguably be achieved with static NVLink partitions, but that would drastically inhibit resource utilization.

In a trusted environment, security isolation may not always be of the strongest concern. However, job reliability always is—and, as a result, fault isolation is, too. An IMEX domain is a distributed system of stateful nature. It is naturally subject to failure scenarios and transient conditions that may lead to a degraded or inconsistent state. Especially at scale, this will happen at a tangible rate. In those situations, the blast radius should be contained to just a single job.

Conceptually, the safest way to maximize fault isolation is to both temporally and spatially tie an individual IMEX domain to just one specific workload—which is what the ComputeDomain implementation ensures under the hood.

Without ComputeDomains, one would have to statically set up long-lived IMEX domains and hence compromise on both (1) and (2). Any home-grown solution for dynamically orchestrating IMEX domains would eventually evolve into something like ComputeDomains and will turn out to be difficult to build. By providing a generic solution, we can save our users from having to go through that effort themselves, and centralize lessons learned.

Using ComputeDomains in Kubernetes

ComputeDomains are provided by the NVIDIA DRA driver for GPUs. In the near term, the DRA driver will be shipped with the NVIDIA GPU Operator. For now, it needs to be installed manually, with a Helm chart.

Detailed installation instructions and prerequisites can be found here. Generally, Kubernetes 1.32 or later is required with DRA APIs enabled as well as CDI. Be sure to actually enable ComputeDomain support upon DRA driver installation (that’s the default), and to run it in an environment that has NVLink partitions set up spanning multiple nodes (for example in a GB200 NVL72 rack, or across racks).

The driver is under heavy development. We recommend staying up-to-date by following our GitHub project; you can read about the latest release, (v25.8.0) here

Deploying workloads

Let’s walk through an example of creating and using a ComputeDomain. The following Kubernetes specification declares a ComputeDomain with the name compute-domain-0:

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: compute-domain-0
spec:
  numNodes: 0	 # <-- this field is deprecated and should always be set to 0
  channel:
    resourceClaimTemplate:
      name: compute-domain-0-rct

No workload refers to this ComputeDomain yet. At this point, it is merely an API object. A ComputeDomain follows workload: It will form just in time around workload pods when they are actually scheduled onto nodes. 

Next, let’s specify a workload and put compute-domain-0 to use by referencing it in the workload.

Say we want to run a job distributed among 18 nodes. The goal is to use four GPUs per node and to establish (all-to-all) NVLink reachability between all 72 GPUs involved.

To that end, in this case, we’re going to run one Kubernetes pod per node. Each pod requests:

  • Four GPUs.
  • To land in the same NVLink partition as all the other pods of this workload (for physical reachability).
  • To join the previously specified ComputeDomain (for logical reachability).

The following Kubernetes deployment specification example achieves all that, with key concepts explained inline:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-mnnvl-workload
spec:
  # Ask for this deployment to be comprised of 18 pods.
  replicas: 18
  selector:
    matchLabels:
      job: ex1
  template:
    metadata:
      labels:
        job: ex1
    spec:
      # Associate all pods in this deployment with the specific ComputeDomain
      # that was previously created. To that end, refer to the so-called
      # resource claim template associated with that domain. The name of that
      # template in this case is defined as `compute-domain-0-rct` in the
      # ComputeDomain API object. Here we also define a new name `cd-0` that
      # is consumed by the container spec below.    
      resourceClaims:
      - name: cd-0
        resourceClaimTemplateName: compute-domain-0-rct  
      # Define a `podAffinity` rule to ensure that all pods will land on nodes
      # in the same NVLink partition. Specifically, require all pods to land on
      # nodes that have the _same_ value set for the `nvidia.com/gpu.clique`
      # node label. This label is set by the NVIDIA GPU Operator (based on
      # static NVLink configuration state).
      affinity:
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: job
                operator: In
                values:
                - ex1
            topologyKey: nvidia.com/gpu.clique    
      containers:
      - name: ctr
        image: ubuntu:22.04
        command: ["mnnvl-workload"]
        resources:
          claims:
          - name: cd-0  # See `resourceClaims` above.
          limits:
            nvidia.com/gpu: 4  # Request four GPUs.

For clarity, the example above makes the connection to the previously specified ComputeDomain by declaring resourceClaimTemplateName: compute-domain-0-rct. The concept of the resource claim template may make more sense now: Under the hood, one unique resource claim is generated per pod in this deployment.

The example above also shows a typical method to ensure that a set of pods gets placed onto nodes that are all part of the same NVLink partition (by aligning on the nvidia.com/gpu.clique node label value). When a ComputeDomain is supposed to expand beyond an individual NVLink partition, this constraint needs to be removed or changed.

Complete and comprehensive examples (including a set of acceptance tests that can be run to verify that ComputeDomains are set up and working correctly) can be found in the DRA driver documentation.

Known limitations and future work

Version 25.8.0 of the NVIDIA DRA Driver for GPUs includes significant improvements for ComputeDomains. Beyond that, more enhancements are on the roadmap toward more flexible scheduling and ease-of-use. 

Here are two of the currently known limitations and the work planned to alleviate them:

  • Currently, only one pod per node can be part of any given ComputeDomain. Users have to be aware of how many GPUs are available in a node, and then typically grab them all from within a single workload pod. The application in that pod then needs to subdivide its work across those GPUs. We are planning to remove this constraint to make the notion of individual nodes less relevant. It will then be possible for the application to be composed of many single-GPU pods that may or may not be placed next to each other on the same node. In that mode, the unit of interest is the individual GPU, and not the individual node—node boundaries become almost transparent.
  • Currently, at most one ComputeDomain is supported per node. This constraint is based on the choice of providing each workload with its dedicated IMEX domain (and the fact that there can be at most one IMEX daemon running per node). If a ComputeDomain occupies only a fraction of a node’s set of GPUs, the remaining GPUs in that node cannot be part of any other ComputeDomain. For example, a six-GPU ComputeDomain in a GB200 rack would always render a number of GPUs unavailable for other ComputeDomains (two in the best case, 18 in the worst case). Lifting that constraint allows for increased resource utilization on the one hand but, on the flipside, may weaken fault isolation between workloads. No universal remedy exists, and we will allow users to pick their sweet spot in the trade-off spectrum between cost efficiency and isolation strength. This work is planned and tracked here.

Additional initiatives are in progress, for example to further enhance robustness at scale and to improve overall debuggability. Follow the issue tracker on GitHub and browse the milestone view for an up-to-date peek into the roadmap. We also encourage you to submit questions, bug reports, and requests for enhancements to the issue tracker.

Summary

As advanced multi‑node GPU architectures like NVIDIA  GB200 NVL72 begin to push the limits of what’s possible in high‑performance AI infrastructure, Kubernetes needs abstractions that can understand and manage the topology of these modern GPU systems. ComputeDomains address this challenge by bridging low‑level constructs such as NVLink and IMEX domains with Kubernetes‑native scheduling and DRA.

ComputeDomains dynamically form, manage, and tear down IMEX domains as workloads move across the cluster, enabling secure, high‑bandwidth GPU‑to‑GPU connectivity without manual setup. The latest v25.8.0 release of the NVIDIA DRA driver for GPUs extends this model with elasticity and fault tolerance, allowing ComputeDomains to expand or contract with workloads, recover automatically from node loss, and accelerate startup times for distributed jobs.

For infrastructure teams, these changes mean multi-node training and inference on GB200 NVL72 or DGX GB200 systems happen with minimal setup. For developers, it means that running distributed training or inference across complex, NVLink‑connected GPU fabrics now feels as simple as deploying a standard Kubernetes workload. Together, these innovations make ComputeDomains a cornerstone for scalable, topology‑aware AI orchestration on NVIDIA GB200 NVL72 and future platforms.

See the NVIDIA DRA driver for GPUs and its latest v25.8.0 release to get started. And check out these other resources:

Discuss (0)

Tags