高效的 CUDA 调试：将 NVIDIA Compute Sanitizer 与 NVIDIA 工具扩展程序结合使用并创建自定义工具

NVIDIA Compute Sanitizer 是一款功能强大的工具，可以节省时间和精力，同时提高 CUDA 应用程序的可靠性和性能。在 CUDA 环境中调试代码既具有挑战性又耗时，尤其是在处理数千个线程时。Compute Sanitizer 可以提供帮助！

在这一系列的第一篇文章中，高效 CUDA 调试：使用 NVIDIA Compute Sanitizer 追踪错误中，我们将讨论如何开始使用 Compute Sanitizer 工具，以检查代码中的内存泄漏和竞争条件。

在第二篇博文中，高效的 CUDA 调试：借助 NVIDIA Compute Sanitizer 实现内存初始化和线程同步。此外，我们还探讨了用于检查内存初始化和线程同步的工具。

在本文中，我们重点介绍了 Compute Sanitizer 的一些其他功能，即它与 NVIDIA 工具扩展程序 (NVTX) 的集成，用于标记代码，以便更直接地使用 Compute Sanitizer.我们还讨论了用于 Compute Sanitizer 的 API 本身，以创建更多用于调试 CUDA 应用的工具。

NVIDIA 计算 Sanitizer

Compute Sanitizer 是一套工具，可以对代码的功能正确性执行不同类型的检查。主要有四个工具：

memcheck：内存访问错误和泄漏检测。
racecheck：共享内存数据访问的危险检测工具。
initcheck：用于检测未初始化的设备全局内存的工具。
synccheck：线程同步风险检测。

除这些工具外， NVIDIA Compute Sanitizer 还具有更多功能：

用于创建针对 CUDA 应用的清理和追踪工具的 API
与 NVIDIA 工具集成 (NVTX)
Coredump 功能用于 CUDA-GDB。
抑制功能，用于管理工具的输出

将 Compute Sanitizer 与 NVTX 结合使用

NVTX 是一种基于 C 的 API，用于标注程序中的代码范围、事件和资源。此标注支持在应用程序运行时收集更多信息，这些信息可用于在分析和分析代码时改进数据呈现。Compute Sanitizer 和 NVTX 之间的集成使您能够使用 NVTX 标注代码，以协助 Compute Sanitizer 捕获错误。

有关 NVTX 标注的更多信息，请参阅以下文章：

C/C++ 和 NVTX：CUDA Pro 提示：使用 NVTX 生成自定义应用程序配置文件时间轴
Python 和 NVTX：NVIDIA 工具扩展程序 API：用于在 Python 和 C/C++ 中分析代码的标注工具。
Fortran 和 NVTX：使用 NVTX 自定义 CUDA Fortran 分析以提高性能和效率。

我们的 NVTX 显存 API 使 CUDA 程序能够将内存限制（例如内存池管理或权限限制以及内存标记）通知 Compute Sanitizer。

内存池管理

NVTX 与 Compute Sanitizer 的第一个示例是 suballocation，它是 NVTX Memory API 的一部分。

通过 API 可以将内存分配标记为内存池。Compute Sanitizer 了解这些池，并可以检测实际正在使用特定分配的哪些部分。然后，如果在代码执行期间访问了内存池的任何未注册部分，Compute Sanitizer 通过其 `memcheck` 工具来检测这些访问。

这是基本内存池的示例，代码名为 `mempool_example.cu`。

#include 

__global__ void populateMemory(int* chunk) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  chunk[i] = i;
}

int main(int argc, char **argv) {
  int poolSize   = 4096 * sizeof(int);
  int numThreads = 63;
  // int bucketSize = numThreads * sizeof(int); // You need this later ...

  void *pool;
  cudaMallocManaged(&pool, poolSize); // Create your memory pool

  // Assign part of the memory pool to the bucket
  auto bucket = (int *)pool + 16; // Address of bucket is 16 bytes into the pool

  // Set values in bucket
  populateMemory<<<1, numThreads>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 1: bucket 0, 1 ..  62: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads-1]);

  // Set some more values in bucket
  populateMemory<<<1, numThreads + 1>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 2: bucket 0, 1 ..  63: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads]);

  cudaFree(pool);
  exit(0);
}

代码示例中，您创建了一个内存池（称为 pool），大小为 4096 个整数。然后，您可以分配该池的一部分，并用变量来标记。`bucket` 变量指向内存池的 16 字节地址开始。

您的意图是将 bucket 容器中的元素数量为 63。首先，您将元素数量存储在 bucketSize 变量中。然后，使用 GPU 内核对数据桶进行填充。块数为 1，线程数为 numThreads，这意味着 populateMemory 在 `bucket` 中执行 1 至 63 次，从而影响 `bucket` 中元素的数量。

尽管您尝试在 `populateMemory` 内核中填写 `bucket`，但由于您将线程数量设置为 `numThreads+1` (64)，这会导致一个额外的线程去处理 `bucket` 中的 63 个值。虽然您的意图是确保 `bucket` 中有 63 个值，但分配 64 个值不会导致错误，因为实际的内存池 `bucket` 很大，足以容纳额外的元素。

使用 `memcheck` 工具运行以确认潜在错误。我们在 NVIDIA V100 GPU 上运行，因此我们将 GPU 架构设置为 `sm_70`。您可能需要根据所运行的内容进行更改。

$ nvcc -o mempool.exe mempool_example.cu -arch=sm_70
$ ./mempool.exe
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63

$compute-sanitizer --tool memcheck ./mempool.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
========= ERROR SUMMARY: 0 errors

NVTX API 提供以下功能以帮助管理内存分配：注册任何 `cudaMalloc` 使用 NVTX 的内存堆寄存器功能进行内存分配。此操作将内存注册为表示可进一步细分为区域的内存范围的堆。以下代码展示了如何执行此操作：

首先，完成将 NVTX 与 Compute Sanitizer 结合使用所需的四个步骤。

对于 C 和 C++，NVTX 是一个仅包含报文头的库，不依赖任何包。通常，这些报文头随您首选的 CUDA 下载提供，例如工具包或 HPC SDK。然而，NVTX Memory API 是相对较新的，现在可以从 /NVIDIA/NVTX GitHub 库获取。未来，它将被包含在工具包中。

特别注意，nvToolsExtMem.h中尚未提供其他方法的头文件。因此，克隆 NVTX GitHub 分支后，请检查是否存在 `dev-mem-api`。

$ git clone --branch dev-mem-api https://github.com/NVIDIA/NVTX.git
…
$ ls NVTX/c/include/nvtx3/
nvToolsExtCuda.h    nvToolsExt.h           nvToolsExtMem.h     nvToolsExtSync.h  nvtxDetail
nvToolsExtCudaRt.h  nvToolsExtMemCudaRt.h  nvToolsExtOpenCL.h  nvtx3.hpp         nvtxExtDetail

现在，您可以在源代码开头添加 NVTX 和 NVTX API 头文件：

#include 
#include

Compute Sanitizer 需要在任何 NVTX 调用之前初始化 CUDA 运行时。无论何时在代码中开始使用 NVTX，这都会发生，具体取决于您开始使用 NVTX 的位置。您可以使用例如 `cudaFree` 的方法来实现。

// Forces CUDA runtime initialization.
cudaFree(0);

最后，创建 NVTX 域。这些是调用 API 所必需的。目前，这些域没有特定的功能，但将用于未来的 Compute Sanitizer 版本。

// Create the NVTX domain
auto mynvtxDomain = nvtxDomainCreateA("my-domain");

好的，这是完成的第一步。现在，使用 NVTX 将池分配注册为内存池或堆：

nvtxMemVirtualRangeDesc_t myPoolRangeDesc = {}; // Descriptor for the
                                                // range memory pool
myPoolRangeDesc.size = poolSize; // Size of the range memory pool
myPoolRangeDesc.ptr  = pool;     // Pointer to the pool itself

nvtxMemHeapDesc_t myHeapDesc = {}; // Descriptor for the heap

myHeapDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
myHeapDesc.structSize = sizeof(nvtxMemHeapDesc_t);
myHeapDesc.usage = NVTX_MEM_HEAP_USAGE_TYPE_SUB_ALLOCATOR;
myHeapDesc.type = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
myHeapDesc.typeSpecificDescSize = sizeof(nvtxMemVirtualRangeDesc_t);
myHeapDesc.typeSpecificDesc = &myPoolRangeDesc

auto mynvtxPool = nvtxMemHeapRegister(mynvtxDomain, &myHeapDesc);

这些步骤已将池分配给变量 `mynvtxPool`。为了使用它之前的示例，您现在必须在池中创建二次分配以表示存储桶。该语法与分配池本身的方式相同，但这次使用区域描述器而不是堆描述符：

nvtxMemVirtualRangeDesc_t mySubRangeDesc = {}; // Descriptor for the range
mySubRangeDesc.size = bucketSize; // Size of your suballocation (in bytes)
mySubRangeDesc.ptr  = bucket;     // Pointer to the suballocation

nvtxMemRegionsRegisterBatch_t myRegionsDesc = {};
myRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
myRegionsDesc.structSize  = sizeof(nvtxMemRegionsRegisterBatch_t);
myRegionsDesc.regionType  = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
myRegionsDesc.heap = mynvtxPool; // The heap you registered earlier
myRegionsDesc.regionCount = 1;
myRegionsDesc.regionDescElementSize = sizeof(nvtxMemVirtualRangeDesc_t);
myRegionsDesc.regionDescElements = &mySubRangeDesc

nvtxMemRegionsRegister(mynvtxDomain, &myRegionsDesc);

这既是您的内存池，也是二次分配存储桶，bucket 已在 NVTX 中注册。这意味着 Compute Sanitizer 可以将其属性纳入其检查。现在，请查看它是否选择了错误的填充尝试 bucket，其值超出预期范围。

这是包含 NVTX 注册的完整代码示例，名为 `mempool_nvtx_example.cu`。

#include 
#include 

#include 

__global__ void populateMemory(int* chunk) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  chunk[i] = i;
}

int main(int argc, char **argv) {
  int poolSize   = 4096 * sizeof(int);
  int numThreads = 63;
  int bucketSize = numThreads * sizeof(int);

  // Forces CUDA runtime initialization.
  cudaFree(0);

  // Create the NVTX domain
  auto mynvtxDomain = nvtxDomainCreateA("my-domain");


  void *pool;
  cudaMallocManaged(&pool, poolSize); // Create your memory pool

  // Register the pool with NVTX
  nvtxMemVirtualRangeDesc_t myPoolRangeDesc = {}; // Descriptor for the
                                                  // range memory pool
  myPoolRangeDesc.size = poolSize; // Size of the range memory pool
  myPoolRangeDesc.ptr  = pool;     // Pointer to the pool itself

  nvtxMemHeapDesc_t myHeapDesc = {}; // Descriptor for the heap

  myHeapDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
  myHeapDesc.structSize = sizeof(nvtxMemHeapDesc_t);
  myHeapDesc.usage = NVTX_MEM_HEAP_USAGE_TYPE_SUB_ALLOCATOR;
  myHeapDesc.type = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
  myHeapDesc.typeSpecificDescSize = sizeof(nvtxMemVirtualRangeDesc_t);
  myHeapDesc.typeSpecificDesc = &myPoolRangeDesc

  auto mynvtxPool = nvtxMemHeapRegister(mynvtxDomain, &myHeapDesc);

  // Assign part of the memory pool to the bucket
  auto bucket = (int *)pool + 16; // Address of bucket is 16 bytes into the pool

  // Register bucket as a suballocated region in NVTX
  nvtxMemVirtualRangeDesc_t mySubRangeDesc = {}; // Descriptor for the range
  mySubRangeDesc.size = bucketSize; // Size of your suballocation (in bytes)
  mySubRangeDesc.ptr  = bucket;     // Pointer to the suballocation

  nvtxMemRegionsRegisterBatch_t myRegionsDesc = {};
  myRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
  myRegionsDesc.structSize  = sizeof(nvtxMemRegionsRegisterBatch_t);
  myRegionsDesc.regionType  = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
  myRegionsDesc.heap = mynvtxPool; // The heap you registered earlier
  myRegionsDesc.regionCount = 1;
  myRegionsDesc.regionDescElementSize = sizeof(nvtxMemVirtualRangeDesc_t);
  myRegionsDesc.regionDescElements = &mySubRangeDesc

  nvtxMemRegionsRegister(mynvtxDomain, &myRegionsDesc);

  // Set values in bucket
  populateMemory<<<1, numThreads>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 1: bucket 0, 1 ..  62: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads-1]);

  // Set some more values in bucket
  populateMemory<<<1, numThreads + 1>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 2: bucket 0, 1 ..  63: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads]);

  cudaFree(pool);
  exit(0);
}

编译并再次通过 Compute Sanitizer 运行。编译步骤中的 include 语句应指向 NVTX 头文件的安装位置。

$ nvcc -I ./NVTX/c/include -o mempool_nvtx.exe mempool_nvtx_example.cu -arch=sm_70
$ compute-sanitizer --tool memcheck --destroy-on-device-error=kernel ./mempool_nvtx.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
========= Invalid __global__ write of size 4 bytes
=========     at populateMemory(int *)+0x70
=========     by thread (63,0,0) in block (0,0,0)
=========     Address 0x7f2a9800013c is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f2a98000040 of size 252 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
. . .
=========
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 0
========= ERROR SUMMARY: 1 error

Compute Sanitizer 确实捕获了写入指定数组末端元素的尝试：

Invalid __global__ write of size 4 bytes

现在，如果您有部分内存池要调整大小甚至销毁，该怎么办？NVTX 内存 API 还提供了对其 NVTX 注册执行此操作的类似方法。

如需调整大小，请返回上一个池和存储桶示例。如果您想将存储桶的大小从 63 个元素调整为 64 个元素，请修改之前的代码示例，并使用以下内容调整 NVTX 注册存储桶的大小，以反映这一点：

// Resizing the sub-allocation within the memory pool
 // You reuse mySubRangeDesc from earlier
 mySubRangeDesc.size = bucketSize + 4; // You want one extra int (4B) element
 mySubRangeDesc.ptr  = bucket;

 nvtxMemRegionsResizeBatch_t myNewRegionsDesc = {};
 myNewRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
 myNewRegionsDesc.structSize = sizeof(mySubRangeDesc);
 myNewRegionsDesc.regionType = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
 myNewRegionsDesc.regionDescCount = 1;
 myNewRegionsDesc.regionDescElementSize = sizeof(mySubRangeDesc);
 myNewRegionsDesc.regionDescElements = &mySubRangeDesc

 nvtxMemRegionsResize(mynvtxDomain, &myNewRegionsDesc);

正如您所见，它与二次分配的初始声明类似，但使用了 `nvtxMemRegionsResize` 函数在最后

仔细检查 Compute Sanitizer 的配置，以确保它对大小存储桶分配值的合理调整。在两次调用之间添加调整大小的注册表代码，例如在示例代码中的 `populateMemory` 函数中进行编译和运行。

$ nvcc -I./NVTX/c/include -o mempool_resize.exe mempool_resize_example.cu -arch=sm_70
$ compute-sanitizer --tool memcheck --destroy-on-device-error=kernel ./mempool_resize.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
========= ERROR SUMMARY: 0 errors

希望您能看到类似的示例。在调整二次分配的注册大小后，现在不会有人对尝试访问添加到最后的新元素提出任何投诉。

使用 NVTX 和 Compute Sanitizer 需要大量的代码行来管理注册池和调整池大小，以及二次分配。对于复杂的代码，这可能会很繁琐。为了简化使用，可以将步骤封装到一个单独的类中。NVTX 和 Compute Sanitizer 库提供了示例代码，展示如何从这些类中获取此类方法，这可以作为您自己代码的起点。

此外还有两个 NVTX API：

命名 API：允许区域或二次分配具有与其关联的 ASCII 名称。然后，它可用于在错误报告中按其名称引用分配，目前支持此报告用于泄漏和未使用的内存报告。
访问权限 API：用于限制分配的访问权限，使其仅为只读或原子访问。

Compute Sanitizer API，用于创建您自己的工具

Compute Sanitizer 随附 API，使您能够针对 CUDA 应用创建自己的清理和追踪工具。这是一组功能，您可以使用这些功能与 Compute Sanitizer 进行交互以进行控制和配置，启用或禁用其功能，以及访问其结果。

该 API 还为您提供了将 Compute Sanitizer 集成到开发工作流程的便捷方式，因为它可以轻松集成到现有的 CUDA 应用中。借助 Compute Sanitizer API，您可以直接利用强大的调试功能，提高 CUDA 应用的可靠性和性能。

它由以下子 API 组成：

回调：支持您在用户代码中注册回调函数，这些回调函数与相关 CUDA 函数或事件相关联，例如 `memcpy` 运算或驱动函数。订阅者可以使用这些回调函数，例如用于事件跟踪。
补丁：支持将补丁函数加载到在 GPU 上执行的设备代码中。然后，它们可以用作仪器点，这意味着每当执行补丁事件时都会执行补丁函数，例如用于设置回调，例如进行内存访问的设备代码。
显存：为标准 CUDA 内存 API 提供替代函数。可以在 Compute Sanitizer 回调函数中安全调用替代函数，例如使用 sanitizerAlloc() 而不是 cudaMalloc()。

这些 API 相结合，使您能够将 Compute Sanitizer 功能整合到自己的工具中。

有关更多信息和一些示例代码，请参阅 NVIDIA Compute Sanitizer API 指南。

结束语

使用 NVIDIA 计算 Sanitizer 立即下载 CUDA 工具包。

希望我们已经为您详细介绍了 Compute Sanitizer 中的一些附加功能。有关更多信息，请参阅 /NVIDIA/compute-sanitizer-samples GitHub 库和 NVIDIA Compute Sanitizer 用户手册。

这些近期的讲座介绍了 Compute Sanitizer 中引入的一些更新功能：

如果您需要支持，NVIDIA 开发者论坛是一个很好的起点。该论坛有专门的 Compute Sanitizer 论坛，专门针对 Compute Sanitizer 工具提供支持。

祝您在寻找错误时好运！

高效的 CUDA 调试：将 NVIDIA Compute Sanitizer 与 NVIDIA 工具扩展程序结合使用并创建自定义工具

NVIDIA 计算 Sanitizer

将 Compute Sanitizer 与 NVTX 结合使用

内存池管理

Compute Sanitizer API，用于创建您自己的工具

结束语

Tags

关于作者

高效的 CUDA 调试：将 NVIDIA Compute Sanitizer 与 NVIDIA 工具扩展程序结合使用并创建自定义工具

NVIDIA 计算 Sanitizer

将 Compute Sanitizer 与 NVTX 结合使用

内存池管理

Compute Sanitizer API，用于创建您自己的工具

结束语

Tags

关于作者

Related posts

借助 Union.ai 和 NVIDIA DGX 云实现 AI 工作流程的大众化

借助 NVIDIA AI 基础模型 VISTA-2D 推进细胞分割和形态分析

使用 GPU 加速的 nvImageCodec 推进医学影像解码

使用 WholeGraph 优化图形神经网络的内存和检索，第 2 部分

借助 NVIDIA NeMo Curator 扩展和整理用于 LLM 训练的高质量数据集