在某些场景中使用 cgroup 缩短 CUDA 初始化时间

在多 GPU 平台上运行的许多 CUDA 应用程序通常使用单个 GPU 来满足其计算需求。在这种情况下，应用程序会支付性能损失，因为 CUDA 必须枚举/初始化系统上的所有 GPU.如果 CUDA 应用程序不需要其他 GPU 可见和可访问，您可以通过将不需要的 GPU 与 CUDA 进程隔离并消除不必要的初始化步骤来启动此类应用程序。

本文将讨论实现此目标的各种方法及其性能优势。

GPU 隔离

在 Linux 系统上，可以使用 Linux 工具(如cgroups.在本节中，我们首先讨论低级方法，然后讨论更高级别的可能方法。

CUDA 提供的用于隔离设备的另一种方法是使用CUDA_VISIBLE_DEVICES虽然在功能上类似，但相较于 NVIDIA Omniverse 的cgroups方法。

使用 cgroups V1 隔离 GPU

控制组提供了一种机制，用于将任务集及其未来的子集聚合或划分到具有专门行为的分层组中。您可以使用cgroups来控制 CUDA 进程可见的 GPU.这可确保仅向其提供 CUDA 进程所需的 GPU.

以下代码提供了一个低级示例，说明如何使用cgroups并将 GPU 完全隔离到单个进程中。请注意，您可能必须在 root shell 中运行这些命令才能正常工作。我们稍后将在本文中展示一个更方便、更高级别的实用程序。

# Create a mountpoint for the cgroup hierarchy as root
$> cd /mnt
$> mkdir cgroupV1Device

# Use mount command to mount the hierarchy and attach the device subsystem to it
$> mount -t cgroup -o devices devices cgroupV1Device
$> cd cgroupV1Device
# Now create a gpu subgroup directory to restrict/allow GPU access
$> mkdir gpugroup
$> cd gpugroup
# in the gpugroup, you will see many cgroupfs files, the ones that interest us are tasks, device.deny and device.allow
$> ls gpugroup
tasks      devices.deny     devices.allow

# Launch a shell from where the CUDA process will be executed. Gets the shells PID
$> echo $$

# Write this PID into the tasks files in the gpugroups folder
$> echo <PID> tasks

# List the device numbers of nvidia devices with the ls command
$> ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195,   0 Jul 11 14:28 /dev/nvidia0
crw-rw-rw- 1 root root 195,   0 Jul 11 14:28 /dev/nvidia1

# Assuming that you only want to allow the CUDA process to access GPU0, you deny the CUDA process access to GPU1 by writing the following command to devices.deny
$> echo 'c 195:1 rmw' > devices.deny

# Now GPU1 will not be visible to The CUDA process that you launch from the second shell.
# To provide the CUDA process access to GPU1, we should write the following to devices.allow

$> echo 'c 195:1 rmw' > devices.allow

完成任务后，请卸载/cgroupV1Device文件夹和 umount 命令。

umount /mnt/cgroupV1Device

要允许或拒绝用户访问系统上的任何其他 GPU，请将这些 GPU 编号写入适当的文件中。以下是拒绝在多 GPU 系统上仅访问 GPU5 和 GPU6 的示例。

在/gpugroup将先前创建的文件夹写入要启动 CUDA 进程的 shell 的 PID，tasks文件：

$> echo <PID> tasks

现在，将 GPU5 和 GPU6 添加到拒绝列表中：

$> echo 'c 195:5 rmw' > devices.deny
$> echo 'c 195:6 rmw' > devices.deny

此时，CUDA 进程无法看到或访问这两个 GPU.若要仅为 CUDA 进程启用特定 GPU，则应将这些 GPU 添加到devices.allow应将文件和 GPU 的其余部分添加到devices.deny文件。

访问控制适用于每个进程。多个进程可以添加到tasks将相同的控制传播到多个进程的文件。

使用 bubblewrap 实用程序隔离 GPU

Bubblewrap 实用程序 (bwrap) 是一个更高级别的实用程序，可用于 Linux 操作系统中的沙箱和访问控制，可用于实现与之前提供的解决方案相同的效果。您可以使用此工具轻松限制或允许从 CUDA 进程访问特定 GPU：

# install bubblewrap utility on Debian-like systems
$>sudo apt-get install -y bubblewrap

# create a simple shell script that uses bubblewap for binding the required GPU to the launched process

#!/bin/sh
# bwrap.sh
GPU=$1;shift   # 0, 1, 2, 3, ..
if [ "$GPU" = "" ]; then echo "missing arg: gpu id"; exit 1; fi
bwrap \
        --bind / / \
        --dev /dev --dev-bind /dev/nvidiactl /dev/nvidiactl --dev-bind /dev/nvidia-uvm /dev/nvidia-uvm  \
        --dev-bind /dev/nvidia$GPU /dev/nvidia$GPU \
        "$@"


# Launch the CUDA process with the bubblewrap utility to only allow access to a specific GPU while running
$> ./bwrap.sh 0 ./test_cuda_app <args>

通过扩展 CUDA 进程，多个 GPU 可用于 CUDA 进程dev-bind代码示例中的选项。

GPU 隔离的性能优势

在本节中，我们比较了 CUDA 驱动程序初始化 API (cuInit) 的性能(经过 256 次迭代测量，具有和不具有 GPU 隔离)。这些 API 在具有四个 A100 类 GPU 的 x86 计算机上运行。

Bar graph shows the performance of cuInit API running on an A104 system with and without GPU isolation using cgroups. The bar on the left shows the performance of cuInit when only a single GPU is exposed to the calling CUDA process via cgroups (~65 ms). The bar on the right shows the performance of cuInit when all four GPUs on the system are made available to the CUDA process (225 ms). — *图 1.CUDA 初始化性能对比*cgroup4 – GPU 测试系统上受限的过程和默认场景

总结

GPU 隔离cgroups在系统上的所有 GPU 都无需由给定 CUDA 进程使用的有限用例中，您可以选择缩短 CUDA 初始化时间。

有关更多信息，请参阅以下资源：

在某些场景中使用 cgroup 缩短 CUDA 初始化时间

GPU 隔离

使用 cgroups V1 隔离 GPU

使用 bubblewrap 实用程序隔离 GPU

GPU 隔离的性能优势

总结

Tags

关于作者

在某些场景中使用 cgroup 缩短 CUDA 初始化时间

GPU 隔离

使用 cgroups V1 隔离 GPU

使用 bubblewrap 实用程序隔离 GPU

GPU 隔离的性能优势

总结

Tags

关于作者

Related posts

加速 AI 开发： NVIDIA AI Workbench 正式发布

突破性的 NVIDIA cuOpt 算法将路线优化解决方案的速度提高 100 倍

借助 GPU 加速和生成式 AI 加速多组分析

如何通过四个步骤将 RAG 应用程序从试点阶段转变为生产阶段

RAPIDS cuDF 可将 pandas 加速近 150 倍，且无需更改代码