Simulation / Modeling / Design

Improving CUDA Initialization Times Using cgroups in Certain Scenarios

Decorative image of light fields in green, purple, and blue.

Many CUDA applications running on multi-GPU platforms usually use a single GPU for their compute needs. In such scenarios, a performance penalty is paid by applications because CUDA has to enumerate/initialize all the GPUs on the system. If a CUDA application does not require other GPUs to be visible and accessible, you can launch such applications by isolating the unwanted GPUs from the CUDA process and eliminating unnecessary initialization steps. 

This post discusses the various methods to accomplish this and their performance benefits.

GPU isolation

GPU isolation can be achieved on Linux systems by using Linux tools like cgroups. In this section, we first discuss a lower-level approach and then a higher-level possible approach.

Another method exposed by CUDA to isolate devices is the use of CUDA_VISIBLE_DEVICES. Although functionally similar, this approach has limited initialization performance gains compared to the cgroups approach.

Isolating GPUs using cgroups V1

Control groups provide a mechanism for aggregating or partitioning sets of tasks and all their future children into hierarchical groups with specialized behavior. You can use cgroups to control which GPUs are visible to a CUDA process. This ensures that only the GPUs that are needed by the CUDA process are made available to it.

The following code provides a low-level example of how to employ cgroups and fully isolate a GPU to a single process. Be aware that you will likely have to run these commands in a root shell to work properly. We show a more convenient, higher-level utility later in this post.

# Create a mountpoint for the cgroup hierarchy as root
$> cd /mnt
$> mkdir cgroupV1Device

# Use mount command to mount the hierarchy and attach the device subsystem to it
$> mount -t cgroup -o devices devices cgroupV1Device
$> cd cgroupV1Device
# Now create a gpu subgroup directory to restrict/allow GPU access
$> mkdir gpugroup
$> cd gpugroup
# in the gpugroup, you will see many cgroupfs files, the ones that interest us are tasks, device.deny and device.allow
$> ls gpugroup
tasks      devices.deny     devices.allow

# Launch a shell from where the CUDA process will be executed. Gets the shells PID
$> echo $$

# Write this PID into the tasks files in the gpugroups folder
$> echo <PID> tasks

# List the device numbers of nvidia devices with the ls command
$> ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jul 11 14:28 /dev/nvidia0
crw-rw-rw- 1 root root 195,   0 Jul 11 14:28 /dev/nvidia1

# Assuming that you only want to allow the CUDA process to access GPU0, you deny the CUDA process access to GPU1 by writing the following command to devices.deny
$> echo 'c 195:1 rmw' > devices.deny

# Now GPU1 will not be visible to The CUDA process that you launch from the second shell.
# To provide the CUDA process access to GPU1, we should write the following to devices.allow

$> echo 'c 195:1 rmw' > devices.allow

When you are done with the tasks, unmount the /cgroupV1Device folder with the umount command.

umount /mnt/cgroupV1Device

To allow or deny the user access to any other GPUs on the system, write those GPU numbers to the appropriate file. Here’s an example of denying access to only GPU5 and GPU6 on a multi-GPU system.

In the /gpugroup folder created earlier, write the PID of the shell from where the CUDA process is to be launched into the tasks file:

$> echo <PID> tasks

Now add GPU5 and GPU6 to the denied list:

$> echo 'c 195:5 rmw' > devices.deny
$> echo 'c 195:6 rmw' > devices.deny

At this point, the CUDA process can’t see or access the two GPUs. To enable only specific GPUs to a CUDA process, those GPUs should be added to the devices.allow file and the rest of the GPUs should be added to the devices.deny file. 

The access controls apply per process. Multiple processes can be added to the tasks file to propagate the same controls to more than one process.

Isolating GPUs using the bubblewrap utility

The bubblewrap utility (bwrap) is a higher-level utility available for sandboxing and access control in Linux operating systems, which can be used to achieve the same effect as the solution presented earlier. You can use this to conveniently restrict or allow access to specific GPUs from a CUDA process:

# install bubblewrap utility on Debian-like systems
$>sudo apt-get install -y bubblewrap

# create a simple shell script that uses bubblewap for binding the required GPU to the launched process

#!/bin/sh
# bwrap.sh
GPU=$1;shift   # 0, 1, 2, 3, ..
if [ "$GPU" = "" ]; then echo "missing arg: gpu id"; exit 1; fi
bwrap \
        --bind / / \
        --dev /dev --dev-bind /dev/nvidiactl /dev/nvidiactl --dev-bind /dev/nvidia-uvm /dev/nvidia-uvm  \
        --dev-bind /dev/nvidia$GPU /dev/nvidia$GPU \
        "$@"


# Launch the CUDA process with the bubblewrap utility to only allow access to a specific GPU while running
$> ./bwrap.sh 0 ./test_cuda_app <args>

More than one GPU can be exposed to a CUDA process by extending the dev-bind option in the code example.

Performance benefits of GPU isolation 

In this section, we compare the performance of the CUDA driver initialization API (cuInit) with and without GPU isolation, measured over 256 iterations. The APIs are being run on an x86-based machine with four A100 class GPUs.

Bar graph shows the performance of cuInit API running on an A104 system with and without GPU isolation using cgroups. The bar on the left shows the performance of cuInit when only a single GPU is exposed to the calling CUDA process via cgroups (~65 ms). The bar on the right shows the performance of cuInit when all four GPUs on the system are made available to the CUDA process (225 ms).
Figure 1. CUDA initialisation performance comparison between a cgroup-constrained process and the default scenario on a four-GPU test system

Summary

GPU isolation using cgroups offers you the option of improving CUDA initialization times in a limited number of use cases where all the GPUs on the system are not required to be used by a given CUDA process.

For more information, see the following resources:

Discuss (0)

Tags