Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS

NVIDIA CUDA developers have access to a wide range of tools and libraries that simplify development and deployment, enabling users to focus on the “what” and the “how” of their applications.

An example of this is Multi-Process Service (MPS), where users can get better GPU utilization by sharing GPU resources across processes. Importantly, this can be done transparently as applications don’t need to be aware of MPS, and no code modifications are needed.

Introducing MLOPart

NVIDIA Blackwell GPUs deliver high bandwidth that is well-suited to training today’s large language models. However, there are cases where applications don’t benefit from the full bandwidth of Blackwell and are more latency sensitive.

Memory Locality Optimized Partition (MLOPart) devices are NVIDIA CUDA devices derived from a GPU and optimized for lower latency. MLOPart is a CUDA MPS feature that enables multi-GPU aware applications to see MLOPart devices.

In the real world, it’s not always easy to determine whether an application is latency-bound or bandwidth-bound. MLOPart is designed to be enabled and disabled using the MPS controller and doesn’t require an application to be rewritten. Developers can do simple A/B testing to see if an application benefits from MLOPart.

MLOPart device enumeration

The defining aspect of MLOPart is that when it is enabled, MLOPart-capable devices appear as multiple distinct CUDA devices, with their own compute and memory resources. In this sense, it is similar to an NVIDIA Multi-Instance GPU (MIG). We’ll compare MLOPart with MIG later in this post.

MLOPart creates CUDA devices that are based on the underlying architecture of GPUs. Where possible, CUDA devices are split along boundaries that’d negatively affect memory latency, with each side of the boundary having the memory and compute resources representing an MLOPart device. For Blackwell, the split is along the die boundaries.

If a GPU doesn’t have such boundaries, no MLOPart devices are created, and the GPU is presented to CUDA applications normally. NVIDIA DGX B200 and NVIDIA B300 are capable of two MLOPart devices per GPU. This number may change with future architectures, so it’s recommended that developers don’t hardcode assumptions about the number of MLOPart devices that a GPU will support.

MLOPart device capabilities and characteristics

An MLOPart device shares similarities with the underlying device, with a few notable exceptions. While in principle, developers don’t need to rewrite applications to use MLOPart devices, they should keep in mind that they don’t share all of the capabilities and characteristics of the underlying devices.

Capabilities and characteristics shared with the underlying device include:

Compute capability

An MLOPart device has the same compute capability and can execute the same GPU binaries as the underlying device. For example, a device that supports MLOPart with compute capability 10.0 will have MLOPart devices that also have compute capability 10.0.

Peer-to-peer ability

An MLOPart device will be capable of the same peer-to-peer communication as the underlying device. For example, if two physical devices are connected by NVIDIA NVLink, any MLOPart devices derived from these two underlying devices will also be connected by NVLink.

The exception to this rule is between MLOPart devices belonging to the same underlying device. In this case, they’re still capable of peer-to-peer communication, but don’t require peer-to-peer communication methods such as NVLink or PCIe.

When peer devices are MLOPart devices belonging to the same underlying device, they’re expected to have lower latency and higher peer-to-peer bandwidth than peer devices connected through other means.

PCI IDs

MLOPart devices share the same PCI ID (bus.device.domain) as the underlying device.

Capabilities and characteristics differing from the underlying device include the following.

Streaming multiprocessor count

Each MLOPart device will have fewer streaming multiprocessors (SMs) than the underlying device. Furthermore, the total SMs in all MLOPart devices with a common shared underlying device may be fewer than the total SMs in the underlying device.

MLOPart devices belonging to the same underlying device have the same number of SMs between them, and the number of SMs is consistent across identical NVIDIA GPUs.

For example, an NVIDIA HGX B200 system with 8 Blackwell GPUs that normally have 148 SMs will result in 16 MLOPart devices with 70 SMs each when MLOPart is enabled.

Available memory

MLOPart devices have a partition of the total memory of the underlying device, and only allocate from that partition, except in the case of CUDA managed memory allocations. Each MLOPart device will have less memory than the underlying device. Each MLOPart device belonging to the same underlying device has the same total memory.

In the current version of MLOPart, it’s possible for memory allocated on one MLOPart device to affect the available memory reported by cuMemGetInfo and cudaMemGetInfo on another MLOPart device from the same underlying device, even though they have separate partitions. Future drivers will enable more rigid memory partitions between MLOPart devices.

Virtual address space

MLOPart devices on the same underlying device share a virtual address space. This means that it’s possible for a buffer overrun of memory allocated on one MLOPart device to corrupt memory allocated on another MLOPart device within the same process.

Universally unique identifier

Each MLOPart device will have its own universally unique identifier (UUID) that can be queried through CUDA APIs. This can be used to uniquely identify MLOPart devices and to filter available CUDA devices using CUDA_VISIBLE_DEVICES.

Deploying with MLOPart

As with other CUDA MPS features, users can control behavior through MPS controller commands.

The start_server command starts an MPS server. In CUDA 13.1, we introduced the -mlopart option to this command. This enables users to start an MPS server that creates MLOPart-enabled MPS clients. As this is done on a per-server basis, multiple users may have different MLOPart configurations, depending on their needs.

In CUDA 13.0, we introduced the device_query MPS controller command to provide information about the CUDA devices enumerated by MPS. After a server has been created, device_query can be used to determine information about the devices that’ll be exposed to clients of that server, such as the device name, device ordinals, and UUIDs.

 $ echo device_query | nvidia-cuda-mps-control
Default
Device Ordinal  PCI IDs        UUID                                      Name                              Attributes
0               0000:1b.00.00  GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4  NVIDIA B200
1               0000:43.00.00  GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100  NVIDIA B200
2               0000:52.00.00  GPU-a517c26e-0f2f-945a-1672-ea75149f54d6  NVIDIA B200
3               0000:61.00.00  GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1  NVIDIA B200
4               0000:9d.00.00  GPU-b5830513-614b-38ac-b177-5cc2f850ea3d  NVIDIA B200
5               0000:c3.00.00  GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d  NVIDIA B200
6               0000:d1.00.00  GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d  NVIDIA B200
7               0000:df.00.00  GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d  NVIDIA B200

Server 14056
Device Ordinal  PCI IDs        UUID                                      Name                              Attributes
N/A             0000:1b.00.00  GPU-ebebf640-14d4-de34-f16e-a5e7da272ac4  NVIDIA B200                       M
0               0000:1b.00.00  GPU-1bd9c0d8-c86a-5a37-acee-411ebcef5fd0  NVIDIA B200 MLOPart 0             MD
1               0000:1b.00.00  GPU-58e7f54c-f60f-56b7-a4c4-b3fb418fde3e  NVIDIA B200 MLOPart 1             MD
N/A             0000:43.00.00  GPU-6d3a75da-dd2e-173e-e797-c0b8ed47a100  NVIDIA B200                       M
2               0000:43.00.00  GPU-68fb01e9-499c-56d4-b768-8fca70a5ddff  NVIDIA B200 MLOPart 0             MD
3               0000:43.00.00  GPU-6cf0c4ea-3a05-52b1-aec6-63acf60df19b  NVIDIA B200 MLOPart 1             MD
N/A             0000:52.00.00  GPU-a517c26e-0f2f-945a-1672-ea75149f54d6  NVIDIA B200                       M
4               0000:52.00.00  GPU-dd670b14-ca31-5dfd-a49b-7220701f4fc6  NVIDIA B200 MLOPart 0             MD
5               0000:52.00.00  GPU-d7433996-1714-5baa-9812-22cecdc792d3  NVIDIA B200 MLOPart 1             MD
N/A             0000:61.00.00  GPU-999b1bd5-82d8-3db2-e2ec-fdae5d1103b1  NVIDIA B200                       M
6               0000:61.00.00  GPU-cff5ab0b-a509-54c8-a9c0-c5ebe3fbd3a0  NVIDIA B200 MLOPart 0             MD
7               0000:61.00.00  GPU-7933cfe7-5139-50d8-ad90-0f7f1ddba559  NVIDIA B200 MLOPart 1             MD
N/A             0000:9d.00.00  GPU-b5830513-614b-38ac-b177-5cc2f850ea3d  NVIDIA B200                       M
8               0000:9d.00.00  GPU-f973284b-7385-576b-80d7-3ea083bcea94  NVIDIA B200 MLOPart 0             MD
9               0000:9d.00.00  GPU-668e4145-b221-5495-a3fe-a5cdc0e6f6eb  NVIDIA B200 MLOPart 1             MD
N/A             0000:c3.00.00  GPU-05f3779e-bfa6-f9c8-256f-6cee98b8871d  NVIDIA B200                       M
10              0000:c3.00.00  GPU-53858feb-87eb-5963-8d47-6fbf4b24cd4a  NVIDIA B200 MLOPart 0             MD
11              0000:c3.00.00  GPU-700b029a-be98-5d13-9a4e-5e8e21386e34  NVIDIA B200 MLOPart 1             MD
N/A             0000:d1.00.00  GPU-2facdb95-1af2-26e3-2c9d-e02f4651675d  NVIDIA B200                       M
12              0000:d1.00.00  GPU-563db4f2-f70a-564d-aa4a-dbd52d6dfc0b  NVIDIA B200 MLOPart 0             MD
13              0000:d1.00.00  GPU-b643e07a-6eda-5cd8-bdde-1788590d0b4b  NVIDIA B200 MLOPart 1             MD
N/A             0000:df.00.00  GPU-7e555b40-ffe0-e066-4db3-4ddd96344f0d  NVIDIA B200                       M
14              0000:df.00.00  GPU-f8f5b46d-7774-57a1-97d2-88f23c3457f0  NVIDIA B200 MLOPart 0             MD
15              0000:df.00.00  GPU-46d7f9b7-0303-5432-b50a-16381f37e365  NVIDIA B200 MLOPart 1             MD

When MLOPart is enabled, device_query shows the MLOPart devices below the device from which they are derived. This is the recommended method for determining UUID values used for CUDA_VISIBLE_DEVICES when launching an application. As CUDA will enumerate more devices than actually exist on the system, there’s ambiguity in the device enumeration.

Note that MLOPart devices only exist in the context of MPS and CUDA. nvidia-smi doesn’t provide information about MLOPart devices.

Lastly, the ps MPS controller command has been extended to display whether a process is using an MLOPart device.

$ while1 -a &

   [1] 52845

$ echo ps | nvidia-cuda-mps-control

PID       ID    SERVER    DEVICE             NAMESPACE      COMMAND  ATTRIBUTES
52845     1     52837     GPU-b13add01-c28c  4026531836     while1      MD

MLOPart in use

Now let’s look at how MLOPart can affect memory latency and bandwidth.

Latency

As an example, let’s look at how MLOPart affects memory latency using a simple kernel that does some atomic operations in a loop.

First, we define the kernel and a helper:

#include <cuda_runtime.h>
#include <vector>
#include <cstdio>

// Helper macro to check for CUDA errors
#define CUDA_CHECK_FAILURE(x) \
if (cudaSuccess != (cudaError_t)x)\
{\
    const char* errName = cudaGetErrorName(x);\
    const char* errStr = cudaGetErrorString(x);\
    printf("%s:%d - %s: %s\n", __FILE__, __LINE__, errName, errStr);\
    exit(EXIT_FAILURE);\
}

// Device memory variable to use to prevent the compiler from optimizing away the memory access
__device__ volatile int dummy;

// Trivial kernel to touch the memory so we can measure latency
__global__ void accessMemoryHighLatency(int *startAddress, size_t memorySizeInBytes) {
    for (int i = 0 ; i < memorySizeInBytes / sizeof(int) ; ++i) {
        dummy = atomicAdd(&startAddress[i], 1);
    }
}

Atomic operations are latency-sensitive, making it easy to measure the difference between using and not using MLOPart. The following is a function that uses CUDA events to measure the runtime of the kernel accessMemoryHighLatency.

// Function to launch the kernel and measure the runtime using CUDA events
float measureKernelRuntime(int *memoryDevPtr, size_t memorySizeInBytes, int numBlocks, int numThreads) {
    cudaEvent_t start = NULL, stop = NULL;
    float time = 0;

    CUDA_CHECK_FAILURE(cudaEventCreate(&start));
    CUDA_CHECK_FAILURE(cudaEventCreate(&stop));

    CUDA_CHECK_FAILURE(cudaEventRecord(start, 0));

    accessMemoryHighLatency<<<numBlocks, numThreads>>>(memoryDevPtr, memorySizeInBytes);
    CUDA_CHECK_FAILURE(cudaPeekAtLastError());

    CUDA_CHECK_FAILURE(cudaEventRecord(stop, 0));
    CUDA_CHECK_FAILURE(cudaEventSynchronize(stop));

    CUDA_CHECK_FAILURE(cudaEventElapsedTime(&time, start, stop));

    CUDA_CHECK_FAILURE(cudaEventDestroy(start));
    CUDA_CHECK_FAILURE(cudaEventDestroy(stop));

    return time;
}

Finally, we can put this all together by creating a simple multi-GPU-aware program.

int main(int argc, char *argv[]) {
    size_t memorySizeInBytes = 32 * 1024 * 1024; // 32 MB
    int numBlocks = 32;
    int numThreads = 1;
    int numDevices = 0;
    float totalTime = 0;

    CUDA_CHECK_FAILURE(cudaGetDeviceCount(&numDevices));

    // Measure the runtime for each device
    for (int i = 0; i < numDevices; i++) {
        // Set the current device
        CUDA_CHECK_FAILURE(cudaSetDevice(i));
        
        // Allocate memory on the device
        int *memoryDevPtr;
        CUDA_CHECK_FAILURE(cudaMalloc(&memoryDevPtr, memorySizeInBytes));
        
        // Measure the runtime
        float time = measureKernelRuntime(memoryDevPtr, memorySizeInBytes, numBlocks, numThreads);
        totalTime += time;
        printf("Device %d - Total time: %f milliseconds\n", i, time);
        
        // Free the memory
        CUDA_CHECK_FAILURE(cudaFree(memoryDevPtr));
    }

    printf("Average time: %f milliseconds\n", totalTime / numDevices);

    return EXIT_SUCCESS;
}

We’ll name this file atomic_memory_access.cu and compile it using nvcc atomic_memory_access.cu -arch=sm_100 -o atomic_memory_access.

To establish a baseline, let’s run the example using MPS, but without MLOPart.

$ nvidia-cuda-mps-control -d
# Optional step of explicitly creating an MPS server. This is also done implicitly when we launch a CUDA application while MPS is active.
$ echo start_server -uid $UID | nvidia-cuda-mps-control
$ ./atomic_memory_access
Device 0 - Total time: 2320.550537 milliseconds
Device 1 - Total time: 2323.710938 milliseconds
Device 2 - Total time: 2334.533447 milliseconds
Device 3 - Total time: 2304.551025 milliseconds
Device 4 - Total time: 2304.328125 milliseconds
Device 5 - Total time: 2316.102295 milliseconds
Device 6 - Total time: 2306.165283 milliseconds
Device 7 - Total time: 2306.362061 milliseconds
Average time: 2314.537842 milliseconds

Here we see an average time of around 2,300 milliseconds for each device. Now let’s enable MLOPart and run it again.

# Quit the MPS controller to cleanup the previous server.
$ echo quit | nvidia-cuda-mps-control
# Now repeat the above steps, with MLOPart enabled.
$ nvidia-cuda-mps-control -d
# Note that we must explicitly start the server with "-mlopart".
$ echo start_server -uid $UID -mlopart | nvidia-cuda-mps-control
$ ./atomic_memory_access
Device 0 - Total time: 1500.194946 milliseconds
Device 1 - Total time: 1475.914062 milliseconds
Device 2 - Total time: 1479.729492 milliseconds
Device 3 - Total time: 1480.196045 milliseconds
Device 4 - Total time: 1478.959106 milliseconds
Device 5 - Total time: 1490.808716 milliseconds
Device 6 - Total time: 1468.943237 milliseconds
Device 7 - Total time: 1479.297241 milliseconds
Device 8 - Total time: 1467.947632 milliseconds
Device 9 - Total time: 1476.900757 milliseconds
Device 10 - Total time: 1477.081421 milliseconds
Device 11 - Total time: 1490.295044 milliseconds
Device 12 - Total time: 1484.558594 milliseconds
Device 13 - Total time: 1481.660156 milliseconds
Device 14 - Total time: 1476.067383 milliseconds
Device 15 - Total time: 1484.143921 milliseconds
Average time: 1480.793457 milliseconds

In this example, we see a significant improvement in execution time per device when using MLOPart. While this was a contrived example, it’s important to compare running with and without MLOPart when deciding how to deploy a specific application.

Bandwidth

Given that MLOPart devices have less memory than a full device, they also have lower DRAM bandwidth than devices not using MLOPart.

MLOPart devices have better peer-to-peer bandwidth between MLOPart devices on the same underlying GPU when compared to devices that must communicate using NVLink or PCIe.

Let’s look at the (partial) results of a bidirectional P2P bandwidth test between MLOPart devices on the same underlying device and not on the same underlying device:

$ ./nvbandwidth -t device_to_device_memcpy_read_ce
...
Running device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4
 0       N/A   2352.76    766.82    743.46    767.51
 1   2402.78       N/A    765.86    744.04    767.03
 2    767.23    744.30       N/A   2349.54    766.00
 3    767.37    743.91   2372.91       N/A    767.30
 4    766.75    743.52    766.89    743.97       N/A

In the above example, devices 0 and 1 are on the same underlying GPU, and devices 2 and 3 are on the same underlying GPU.

In the case of B200, peers normally use NVLink when initiating an operation such as cuMemcpyAsync. If these B200 peers are MLOPart devices on the same B200 chip, they can instead use the much faster NV-HBI.

Considerations when using MLOPart

As mentioned previously, using MLOPart implies choosing lower latency over higher bandwidth. This isn’t the only tradeoff that must be evaluated when using MLOPart.

Device filtering through `CUDA_VISIBLE_DEVICES`

The devices available to MPS servers and clients can be filtered and/or remapped using the CUDA_VISIBLE_DEVICES environment variable. Often, this is done using device ordinals. With MPS, this can cause errors if the same value CUDA_VISIBLE_DEVICES is used for both the controller and server/clients if remapping isn’t taken into account.

For example, given a system with 8 CUDA devices, the MPS controller can be initialized to filter out the odd-numbered devices (CUDA_VISIBLE_DEVICES=0,2,4,6). In this scenario, the MPS server and clients will only see at most 4 CUDA devices, even without using CUDA_VISIBLE_DEVICES. Using the same value for CUDA_VISIBLE_DEVICES will fail since we can only see devices 0-3. For this reason, it’s recommended to use UUIDs, which are unambiguous.

When MLOPart is enabled, there’s an additional inconsistency to be aware of. UUIDs of the devices visible to the MPS controller and an MPS server/client with MLOPart enabled are different. When using CUDA_VISIBLE_DEVICES, it’s recommended to execute the device_query command after the MPS server with MLOPart has been started to determine the UUIDs that will be available to MPS clients.

Fewer compute resources

When MLOPart is enabled, the MLOPart devices may have some SMs disabled. There’s a tradeoff between performance gains from reduced memory latency and performance losses from fewer compute resources. These should be weighed on a per-application basis.

Managed memory

Managed memory doesn’t benefit from MLOPart. As MLOPart requires creating GPU memory for low-latency allocations, this can’t be done with managed memory. Attempting to use managed memory will work as it normally does, and allocations can still be created using managed memory APIs, but they aren’t expected to see performance benefits.

Access modifiers

The cuMemSetAccess API enables programmers to specify access properties for CUDA allocations. When using this API with respect to MLOPart devices, the least restrictive property set on all MLOPart devices belonging to the same underlying GPU is applied. For example, setting a buffer as read-only for one MLOPart device and read-write (default) for another MLOPart device results in both MLOPart devices having read-write access, until both are updated to a more restrictive access type.

x86 requirement

MLOPart is currently only supported on x86 platforms. Support for ARM platforms will be available in a future release.

Comparison to MIG

MIG can be used to create multiple CUDA devices from a single GPU, as is done with MLOPart. Certain MIG configurations can also reduce latency at the cost of bandwidth, while requiring no code changes.

Topic	MIG	MLOPart / MPS
Privilege required	Requires superuser privilege to configure	Doesn’t require superuser privilege
Scope	System-wide setting	Per-user / per-server setting
Memory isolation	Enforces strict memory isolation between MIG GPU instances	Memory from one MLOPart device may corrupt another on the same GPU
Performance isolation	Enforces strict performance isolation between MIG compute instances	Performance interference may occur between MLOPart devices

Table 1. Comparing MIG to MLOPart / MPS

To learn more about MLOPart, CUDA MPS, and how to maximize GPU utilization, check out the MPS documentation.

Acknowledgements: Thanks to the following NVIDIA contributors: Alfred Barnat, Ehren Bendler, Alicia Hu, Balint Joo, Ze Long, Yashwant Marathe, Vance Miller, Kyrylo Perelygin, Will Pierce, Yifan Yang

Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS

Introducing MLOPart

MLOPart device enumeration

MLOPart device capabilities and characteristics

Deploying with MLOPart

MLOPart in use

Latency

Bandwidth

Considerations when using MLOPart

Device filtering through `CUDA_VISIBLE_DEVICES`

Fewer compute resources

Managed memory

Access modifiers

x86 requirement

Comparison to MIG

Tags

About the Authors

Boost GPU Memory Performance with No Code Changes Using NVIDIA CUDA MPS

Introducing MLOPart

MLOPart device enumeration

MLOPart device capabilities and characteristics

Deploying with MLOPart

MLOPart in use

Latency

Bandwidth

Considerations when using MLOPart

Device filtering through CUDA_VISIBLE_DEVICES

Fewer compute resources

Managed memory

Access modifiers

x86 requirement

Comparison to MIG

Tags

About the Authors

Comments

Related posts

Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization

CUDA Toolkit Now Available for NVIDIA Blackwell

Simplifying GPU Programming for HPC with NVIDIA Grace Hopper Superchip

Discovering New Features in CUDA 11.4

Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features

Related posts

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

Controlling Floating-Point Determinism in NVIDIA CCCL

How to Minimize Game Runtime Inference Costs with Coding Agents

cuTile.jl Brings NVIDIA CUDA Tile-Based Programming to Julia

5 New Digital Twin Products Developers Can Use to Build 6G Networks

Device filtering through `CUDA_VISIBLE_DEVICES`