Understanding Memory Management on Hardware-Coherent Platforms

If you’re an application developer or a cluster administrator, you’ve likely seen how non-uniform memory access (NUMA) can impact system performance. When an application is not fully NUMA-aware, performance can be inconsistent and unpredictable.

Because of these challenges, NVIDIA released the Coherent Driver-based Memory Management (CDMM) mode for the NVIDIA driver for platforms that are hardware-coherent, such as GH200, GB200 and GB300. CDMM allows the NVIDIA driver, instead of the OS, to control and manage the GPU memory. This permits much more fine-grained control by the application to put data in the appropriate memory space and subsequently extract maximum performance.

In this blog we’re going to describe the differences between NUMA and CDMM and how they can impact application performance. We also published a whitepaper on this topic that you can check out for even more information.

What is NUMA?

NUMA mode is the current default for the NVIDIA Driver on hardware coherent platforms. NUMA exposes the entire CPU (host) memory and GPU (device) memory to the OS. This means that standard Linux APIs such as malloc and mmap, as well as CUDA APIs, can allocate memory on both the CPU and GPU. It also facilitates dynamic memory migration between CPU and GPU via user space APIs, or automatically by the kernel to optimize resource utilization.

An important side effect to consider, though, is that NUMA mode causes GPU memory to be treated as a generic memory pool, meaning that the ability to strictly isolate GPU memory from general OS system functions is limited. In typical NUMA behavior, memory may spill onto the GPU, which may not be desirable for application performance.

That’s why NVIDIA provides an alternative: Coherent Driver-based Memory Management (CDMM) mode.

What are hardware-coherent platforms?

Several NVIDIA systems—including the GH200, GB200, and GB300—contain direct NVLink chip-to-chip (C2C) connections between the CPU and the GPU. That introduces a powerful capability not present on PCIe-connected systems: hardware coherent memory. It allows both CPU and GPU memory to be directly addressed from either processor.

This can have some unintended consequences for applications that rely on specific behaviors of NUMA. In particular, the operating system may select GPU memory for unexpected or surprising uses, such as caching files or avoiding out-of-memory (OOM) conditions from an allocation request. For some applications and workflows, especially those that have been optimized for a particular layout of CPU and GPU memory (like Kubernetes), these differences may be undesirable.

The new CDMM mode addresses these challenges and will be particularly useful for applications like Kubernetes.

How NUMA impacts Kubernetes

Because Kubernetes is such a ubiquitous way to operate large GPU clusters, there are some specific and unexpected behaviors that can be encountered when running Kubernetes in NUMA mode. These behaviors may hurt performance and even application functionality.

Memory over-reporting: Kubernetes incorrectly includes GPU memory in its system memory count, leading to pods requesting more memory than available and causing OOM failures.
Pod memory limits apply to GPU memory, not just system memory: Kubernetes pod memory limits, designed for system memory, incorrectly apply to both system and GPU memory when system-allocated memory is used, as each GPU is exposed as a NUMA node. This breaks the intended Pod spec API contract.
Isolating GPU memory amongst pods: Kubernetes pods, by default, can access all memory across NUMA nodes, including GPU memory. This allows containers to allocate memory on GPUs they don’t have access to, breaking isolation.

For these reasons, we recommend using CDMM mode when using Kubernetes.

What is CDMM?

CDMM is an alternative operating mode for NVIDIA drivers that prevents GPU memory from being exposed to the operating system as a software NUMA node. Instead, the NVIDIA device driver directly manages GPU memory, separating it from the CPU’s system memory. This approach is inspired by the PCIe-attached GPU model, where GPU memory is distinct from system memory.

In CDMM mode, the CPU memory is managed by the Linux kernel and the GPU memory is managed by the NVIDIA driver. This means the NVIDIA driver, not the OS, is responsible for managing the GPU memory and has full control over how the GPU memory is used, thereby offering greater control and often better application performance.

How CDMM affects CUDA developers

The primary impact of CDMM is in the migration of system allocated memory. In the current implementation of CDMM, system allocated memory will not be migrated to the GPU. The GPU can still access system allocated memory across the C2C link, but memory pages will not be migrated.

For example, when an application uses hints to encourage migration using functions such as cudaMemPrefetchAsync(), cudaMemPrefetchBatchAsync(),cudaMemDiscardAndPrefetchBatchAsync(), and cudaMemAdvise(SetPreferredLocation), the pages will not migrate.

How CDMM affects system administration

When the system is in CDMM mode, there will still be NUMA nodes corresponding to the GPUs, but they will not present any memory to the OS. Using tools such as numactl or mbind won’t have any effect when applied to GPU memory. We recommend these tools NOT be used in CDMM mode for any GPU memory management. They can still be used to manage system memory.

CDMM is currently the default mode for Kubernetes-based GPU operator deployments starting from Linux driver 580.65.06 and greater. To enable CDMM you need to pass a kernel module parameter and value when the driver is loaded. For the exact command and syntax to enable CDMM mode, please see the CDMM whitepaper.

Guidelines for CDMM and NUMA usage

The following highlights the main differences between CDMM and NUMA modes, and when to consider using one mode or the other.

Application-specific memory management

NUMA mode: Best for applications using OS NUMA APIs and relying on OS management of total system memory (CPU memory + GPU memory).
CDMM mode: Ideal for applications needing direct GPU memory control, bypassing OS.

Memory pooling

NUMA mode: Allows GPU and CPU memory to form a larger single pool. Workloads benefit from aggregated memory and bandwidth management.
CDMM mode: Driver-managed, preventing OS from using GPU memory in a larger pool. GPU memory is dedicated to GPU-specific data.

GPU memory usage: visibility and measurement

NUMA mode: Standard tools report GPU memory use within the integrated pool, filterable by NUMA thereby providing overall system memory view.
CDMM mode: Offers fine-grained control and visibility into GPU memory. Driver-managed GPU memory gives administrators and developers clear understanding of consumption for performance diagnosis and optimization.

Summary

The following table highlights the major differences in how memory is handled between NUMA and CDMM modes.

	NUMA	CDMM
Memory management	OS manages both CPU and GPU	OS manages CPU,NVIDIA driver manages GPU
GPU memory exposure	Exposed to OS as generic pool	Not exposed to OS for use
Memory migration	Dynamic migration of system allocated memory between CPU and GPU	System allocated memory NOT migrated to GPU

Table 1. Summary of differences in memory behavior between NUMA and CDMM modes

By understanding and strategically implementing CDMM, developers and administrators can unlock the full potential of NVIDIA hardware-coherent memory architectures, ensuring optimal performance and control for their GPU-accelerated workloads.

If you’re using a hardware-coherent platform such as GH200, GB200 or GB300, take a look at the whitepaper. And consider enabling CDMM mode to allow for fine-grained application control of GPU memory, especially if you’re using Kubernetes.