Efficient CUDA Debugging: Using NVIDIA Compute Sanitizer with NVIDIA Tools Extension and Creating Custom Tools

NVIDIA Compute Sanitizer is a powerful tool that can save you time and effort while improving the reliability and performance of your CUDA applications. Debugging code in the CUDA environment can be both challenging and time-consuming, especially when dealing with thousands of threads. Compute Sanitizer can help!

In the first post in this series, Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer, we discussed how to get started with some of the Compute Sanitizer tools to check for memory leaks and race conditions while debugging code.

In the second post, Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer, we explored the tools for checking memory initialization and thread synchronization.

In this post, we highlight some of the other capabilities of Compute Sanitizer, namely its integration with the NVIDIA Tools Extension (NVTX) for marking up code to facilitate working with Compute Sanitizer more directly. We also discuss the API for Compute Sanitizer itself, to enable the creation of more tools for debugging CUDA applications.

NVIDIA Compute Sanitizer

Compute Sanitizer is a suite of tools that can perform different types of checks on the functional correctness of your code. There are four main tools :

memcheck: Memory access error and leak detection.
racecheck: Shared memory data access hazard detection tool.
initcheck: Uninitialized device global memory access detection tool.
synccheck: Thread synchronization hazard detection.

As well as these tools, NVIDIA Compute Sanitizer has more capabilities:

An API to enable the creation of sanitizing and tracing tools that target CUDA applications
Integration with NVIDIA Tools Extension (NVTX)
Coredump support for use with CUDA-GDB
Suppression features for managing the output of the tool

Using Compute Sanitizer with NVTX

NVTX is a C-based API for annotating code ranges, events, and resources in a program. This annotation enables the collection of more information when the application is run, which could be used to improve the data presentation when profiling and analyzing a code. The integration between Compute Sanitizer and NVTX enables you to use NVTX to annotate your code to assist Compute Sanitizer in catching bugs.

For more information about NVTX annotation, see the following posts:

C/C++ and NVTX: CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
Python and NVTX: NVIDIA Tools Extension API: An Annotation Tool for Profiling Code in Python and C/C++
Fortran and NVTX: Customize CUDA Fortran Profiling with NVTX

The NVTX Memory API enables CUDA programs to inform Compute Sanitizer about memory limitations, such as memory pool management or permission restrictions, as well as memory labeling.

Memory pool management

The first example of NVTX integration with Compute Sanitizer comes through the suballocation part of the NVTX Memory API.

With the API, you can annotate memory allocations as memory pools. Compute Sanitizer is aware of these pools and can detect which parts of a specific allocation are actually being used. Then, if any unregistered parts of the memory pool are accessed during the execution of the code, these accesses are detected through the Compute Sanitizer memcheck tool.

Here’s an example of a basic memory pool, mempool_example.cu.

#include <stdio.h>

__global__ void populateMemory(int* chunk) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  chunk[i] = i;
}

int main(int argc, char **argv) {
  int poolSize   = 4096 * sizeof(int);
  int numThreads = 63;
  // int bucketSize = numThreads * sizeof(int); // You need this later ...

  void *pool;
  cudaMallocManaged(&pool, poolSize); // Create your memory pool

  // Assign part of the memory pool to the bucket
  auto bucket = (int *)pool + 16; // Address of bucket is 16 bytes into the pool

  // Set values in bucket
  populateMemory<<<1, numThreads>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 1: bucket 0, 1 ..  62: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads-1]);

  // Set some more values in bucket
  populateMemory<<<1, numThreads + 1>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 2: bucket 0, 1 ..  63: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads]);

  cudaFree(pool);
  exit(0);
}

In the code example, you create a memory pool (helpfully called pool!) of size 4096 integers. You then assign a section of that pool, noted by the variable bucket, to start at an address 16 bytes in from the start of the pool.

You have the intention of setting bucket to have numThreads elements, in this case 63, determined by the variable bucketSize. You then fill up your bucket with some values, using the GPU kernel populateMemory. The number of blocks is set to 1 and the number of threads is set to numThreads, which means populateMemory executes 1×63 times, setting each and every value in bucket as intended.

However, you then try and fill up bucket again through the populateMemory kernel. This time, you set the number of threads to numThreads+1 (64). Your intention was for bucket to have 63 values but now you are trying to assign 64. However, this does not cause an error. You are not accessing any out-of-bounds memory for instance, as the actual memory pool that bucket is part of is large enough to accommodate the extra element.

Compile it and run it through memcheck to confirm for yourself that this potential bug is not being picked up. We ran on an NVIDIA V100 GPU so we set the GPU architecture to sm_70. You may have to change this, depending on what you are running on.

$ nvcc -o mempool.exe mempool_example.cu -arch=sm_70
$ ./mempool.exe
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63

$compute-sanitizer --tool memcheck ./mempool.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
========= ERROR SUMMARY: 0 errors

This is where the NVTX API can assist. You can register any cudaMalloc memory allocation with NVTX using its memory heap register function, nvtxMemHeapRegister. This registers the memory as a heap representing a range of memory that can be further subdivided into regions. You can do that in this code with the following process.

First, complete the four steps that are required to use NVTX with Compute Sanitizer.

For C and C++, NVTX is a header-only library with no dependencies, so you must get the NVTX headers for inclusion. Ordinarily, these come with your preferred CUDA download, such as the toolkit or the HPC SDK. However, the NVTX Memory API is relatively new so for now get it from the /NVIDIA/NVTX GitHub repo. In the future, it will be included as part of the toolkit.

In particular, it is the nvToolsExtMem.h header that is not yet available through the other methods, so check that it is there after you have cloned the NVTX GitHub branch dev-mem-api:

$ git clone --branch dev-mem-api https://github.com/NVIDIA/NVTX.git
…
$ ls NVTX/c/include/nvtx3/
nvToolsExtCuda.h    nvToolsExt.h           nvToolsExtMem.h     nvToolsExtSync.h  nvtxDetail
nvToolsExtCudaRt.h  nvToolsExtMemCudaRt.h  nvToolsExtOpenCL.h  nvtx3.hpp         nvtxExtDetail

Now you can include the NVTX and NVTX API header files at the start of the source code:

#include <nvtx3/nvToolsExt.h>
#include <nvtx3/nvToolsExtMem.h>

Compute Sanitizer requires that the CUDA runtime be initialized before any NVTX calls. This may happen in your code anyway, depending on where you start to use NVTX, but you can force it by using cudaFree, for example:

// Forces CUDA runtime initialization.
cudaFree(0);

Finally, create an NVTX domain. These are required for calls to the API. Currently, the domains have no particular function but will for future Compute Sanitizer versions.

// Create the NVTX domain
auto mynvtxDomain = nvtxDomainCreateA("my-domain");

Okay, that’s the prerequisite steps done. Now, register the pool allocation as a memory pool or heap with NVTX:

nvtxMemVirtualRangeDesc_t myPoolRangeDesc = {}; // Descriptor for the
                                                // range memory pool
myPoolRangeDesc.size = poolSize; // Size of the range memory pool
myPoolRangeDesc.ptr  = pool;     // Pointer to the pool itself

nvtxMemHeapDesc_t myHeapDesc = {}; // Descriptor for the heap

myHeapDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
myHeapDesc.structSize = sizeof(nvtxMemHeapDesc_t);
myHeapDesc.usage = NVTX_MEM_HEAP_USAGE_TYPE_SUB_ALLOCATOR;
myHeapDesc.type = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
myHeapDesc.typeSpecificDescSize = sizeof(nvtxMemVirtualRangeDesc_t);
myHeapDesc.typeSpecificDesc = &myPoolRangeDesc;

auto mynvtxPool = nvtxMemHeapRegister(mynvtxDomain, &myHeapDesc);

Those steps have registered the pool and assigned it to the variable mynvtxPool. To make use of it for the earlier example, you now must create a suballocation within the pool to represent the bucket. The syntax is not dissimilar to the way that you allocated the pool itself, but this time, use a region descriptor rather than a heap descriptor:

nvtxMemVirtualRangeDesc_t mySubRangeDesc = {}; // Descriptor for the range
mySubRangeDesc.size = bucketSize; // Size of your suballocation (in bytes)
mySubRangeDesc.ptr  = bucket;     // Pointer to the suballocation

nvtxMemRegionsRegisterBatch_t myRegionsDesc = {};
myRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
myRegionsDesc.structSize  = sizeof(nvtxMemRegionsRegisterBatch_t);
myRegionsDesc.regionType  = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
myRegionsDesc.heap = mynvtxPool; // The heap you registered earlier
myRegionsDesc.regionCount = 1;
myRegionsDesc.regionDescElementSize = sizeof(nvtxMemVirtualRangeDesc_t);
myRegionsDesc.regionDescElements = &mySubRangeDesc;

nvtxMemRegionsRegister(mynvtxDomain, &myRegionsDesc);

That’s both your pool of memory and the suballocation bucket now registered with NVTX. This means that Compute Sanitizer can include their properties as part of its checks. Now, see if it picks up an incorrect attempt to populate bucket outside of its intended range.

Here’s the full code example, including NVTX registration, of a basic memory pool now with NVTX registration, mempool_nvtx_example.cu.

#include <nvtx3/nvToolsExt.h>
#include <nvtx3/nvToolsExtMem.h>

#include <stdio.h>

__global__ void populateMemory(int* chunk) {
  int i = threadIdx.x + blockDim.x * blockIdx.x;
  chunk[i] = i;
}

int main(int argc, char **argv) {
  int poolSize   = 4096 * sizeof(int);
  int numThreads = 63;
  int bucketSize = numThreads * sizeof(int);

  // Forces CUDA runtime initialization.
  cudaFree(0);

  // Create the NVTX domain
  auto mynvtxDomain = nvtxDomainCreateA("my-domain");


  void *pool;
  cudaMallocManaged(&pool, poolSize); // Create your memory pool

  // Register the pool with NVTX
  nvtxMemVirtualRangeDesc_t myPoolRangeDesc = {}; // Descriptor for the
                                                  // range memory pool
  myPoolRangeDesc.size = poolSize; // Size of the range memory pool
  myPoolRangeDesc.ptr  = pool;     // Pointer to the pool itself

  nvtxMemHeapDesc_t myHeapDesc = {}; // Descriptor for the heap

  myHeapDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
  myHeapDesc.structSize = sizeof(nvtxMemHeapDesc_t);
  myHeapDesc.usage = NVTX_MEM_HEAP_USAGE_TYPE_SUB_ALLOCATOR;
  myHeapDesc.type = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
  myHeapDesc.typeSpecificDescSize = sizeof(nvtxMemVirtualRangeDesc_t);
  myHeapDesc.typeSpecificDesc = &myPoolRangeDesc;

  auto mynvtxPool = nvtxMemHeapRegister(mynvtxDomain, &myHeapDesc);

  // Assign part of the memory pool to the bucket
  auto bucket = (int *)pool + 16; // Address of bucket is 16 bytes into the pool

  // Register bucket as a suballocated region in NVTX
  nvtxMemVirtualRangeDesc_t mySubRangeDesc = {}; // Descriptor for the range
  mySubRangeDesc.size = bucketSize; // Size of your suballocation (in bytes)
  mySubRangeDesc.ptr  = bucket;     // Pointer to the suballocation

  nvtxMemRegionsRegisterBatch_t myRegionsDesc = {};
  myRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
  myRegionsDesc.structSize  = sizeof(nvtxMemRegionsRegisterBatch_t);
  myRegionsDesc.regionType  = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
  myRegionsDesc.heap = mynvtxPool; // The heap you registered earlier
  myRegionsDesc.regionCount = 1;
  myRegionsDesc.regionDescElementSize = sizeof(nvtxMemVirtualRangeDesc_t);
  myRegionsDesc.regionDescElements = &mySubRangeDesc;

  nvtxMemRegionsRegister(mynvtxDomain, &myRegionsDesc);

  // Set values in bucket
  populateMemory<<<1, numThreads>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 1: bucket 0, 1 ..  62: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads-1]);

  // Set some more values in bucket
  populateMemory<<<1, numThreads + 1>>>(bucket);
  cudaDeviceSynchronize();
  printf("After populateMemory 2: bucket 0, 1 ..  63: %d %d .. %d\n", bucket[0], bucket[1], bucket[numThreads]);

  cudaFree(pool);
  exit(0);
}

Compile that and run it through Compute Sanitizer again. The include statement in the compile step should point to where you have installed your NVTX headers.

$ nvcc -I ./NVTX/c/include -o mempool_nvtx.exe mempool_nvtx_example.cu -arch=sm_70
$ compute-sanitizer --tool memcheck --destroy-on-device-error=kernel ./mempool_nvtx.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
========= Invalid __global__ write of size 4 bytes
=========     at populateMemory(int *)+0x70
=========     by thread (63,0,0) in block (0,0,0)
=========     Address 0x7f2a9800013c is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f2a98000040 of size 252 bytes
=========     Saved host backtrace up to driver entry point at kernel launch time
. . .
=========
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 0
========= ERROR SUMMARY: 1 error

Compute Sanitizer did capture the attempt to write to an element off the end of the designated array:

Invalid __global__ write of size 4 bytes

Now, what if you have parts of your memory pool to resize or even destroy? The NVTX memory API also provides analogous methods for doing that to their NVTX registrations.

For resizing, return to the previous pool and bucket example. If you want to resize your bucket from 63 to 64 elements, modifying the previous code example with the following also resizes your NVTX-registered bucket to reflect this:

// Resizing the sub-allocation within the memory pool
 // You reuse mySubRangeDesc from earlier
 mySubRangeDesc.size = bucketSize + 4; // You want one extra int (4B) element
 mySubRangeDesc.ptr  = bucket;

 nvtxMemRegionsResizeBatch_t myNewRegionsDesc = {};
 myNewRegionsDesc.extCompatID = NVTX_EXT_COMPATID_MEM;
 myNewRegionsDesc.structSize = sizeof(mySubRangeDesc);
 myNewRegionsDesc.regionType = NVTX_MEM_TYPE_VIRTUAL_ADDRESS;
 myNewRegionsDesc.regionDescCount = 1;
 myNewRegionsDesc.regionDescElementSize = sizeof(mySubRangeDesc);
 myNewRegionsDesc.regionDescElements = &mySubRangeDesc;

 nvtxMemRegionsResize(mynvtxDomain, &myNewRegionsDesc);

As you can see, it is a similar process to the initial declaration of the suballocation but with the use of function nvtxMemRegionsResize at the end.

Double-check that Compute Sanitizer is happy with the now-legitimate attempt to assign values to the resized bucket. Add the resizing registration code between the two calls to populateMemory in the example code and then compile and run.

$ nvcc -I./NVTX/c/include -o mempool_resize.exe mempool_resize_example.cu -arch=sm_70
$ compute-sanitizer --tool memcheck --destroy-on-device-error=kernel ./mempool_resize.exe
========= COMPUTE-SANITIZER
After populateMemory 1: bucket 0, 1 ..  62: 0 1 .. 62
After populateMemory 2: bucket 0, 1 ..  63: 0 1 .. 63
========= ERROR SUMMARY: 0 errors

Hopefully, you see something like this example. As you’ve resized your registration of the suballocation, there are now no complaints about attempts to access the new element added to the end.

There are a fair number of lines of code required to register and resize your pools and suballocations to use them with NVTX and Compute Sanitizer. This may become cumbersome for more complicated code so it might be useful to encapsulate the steps into a separate class. Handily, there’s a nice example of such an approach available from the /NVIDIA/compute-sanitizer-samples GitHub repo, so that’s a great starting point for your own code.

There are two more NVTX APIs to mention:

Naming API: Enables a region or suballocation to have an ASCII name associated with it. It can then be used to refer to an allocation by its name in error reports, which is currently supported for leaks and unused memory reporting.
Permissions API: Enables allocation access permissions to be restricted as, for example, read-only or atomic.

Compute Sanitizer API for creating your own tools

Compute Sanitizer comes with APIs that enable you to create your own sanitizing and tracing tools to target CUDA applications. It is a set of functions that you can use to interact with the Compute Sanitizer for control and configuration, to enable or disable its features, and to access its results.

The API also provides a convenient way for you to integrate Compute Sanitizer into your development workflow, as it can be easily integrated into existing CUDA applications. With the Compute Sanitizer API, you can take advantage of the powerful debugging capabilities directly and improve the reliability and performance of your CUDA applications.

It consists of the following sub-APIs:

Callback: Enables you to register a callback in user code, where the callback can be associated with groups of related CUDA functions or events, such as memcpy operations or driver functions. These callbacks can then be consumed by a subscriber, for example for event tracking.
Patching: Enables the loading of patch functions to be inserted into device code executed on the GPU. They can then be used as instrumentation points, meaning that the patch function is executed whenever the patched event is executed, for example, to set up callbacks such as device code making a memory access.
Memory: Provides replacement functions for the standard CUDA memory API. The replacements can be called safely from within Compute Sanitizer callbacks, for example using the replacement sanitizerAlloc instead of cudaMalloc.

In combination, these APIs give you the ability to incorporate Compute Sanitizer functionality into your own tools.

For more information and some example code, see the NVIDIA Compute Sanitizer API Guide.

Conclusion

Use NVIDIA Compute Sanitizer today by downloading the CUDA Toolkit.

Hopefully, we’ve given you more ideas of some of the additional features in Compute Sanitizer. There are more examples in the /NVIDIA/compute-sanitizer-samples GitHub repo. For more information, see NVIDIA Compute Sanitizer User Manual.

These recent talks cover some of the newer features introduced in Compute Sanitizer:

For support, the Developer Forums is a great place to start, including the dedicated Compute Sanitizer forum.

Good luck on your bug hunt!

Efficient CUDA Debugging: Using NVIDIA Compute Sanitizer with NVIDIA Tools Extension and Creating Custom Tools

NVIDIA Compute Sanitizer

Using Compute Sanitizer with NVTX

Memory pool management

Compute Sanitizer API for creating your own tools

Conclusion

Related resources

Tags

About the Authors

Efficient CUDA Debugging: Using NVIDIA Compute Sanitizer with NVIDIA Tools Extension and Creating Custom Tools

NVIDIA Compute Sanitizer

Using Compute Sanitizer with NVTX

Memory pool management

Compute Sanitizer API for creating your own tools

Conclusion

Related resources

Tags

About the Authors

Comments

Related posts

Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer

New Video Tutorial: Profiling and Debugging NVIDIA CUDA Applications

Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer

CUDACasts Episode 14: Racecheck Analysis with CUDA 5.5

CUDACasts Episode #9: Explore GPU device memory with Nsight Eclipse Edition

Related posts

Powerful Shader Insights: Using Shader Debug Info with NVIDIA Nsight Graphics

Efficient CUDA Debugging: Memory Initialization and Thread Synchronization with NVIDIA Compute Sanitizer

Speed Up GPU Crash Debugging with NVIDIA Nsight Aftermath

CUDA Toolkit 12.2 Unleashes Powerful Features for Boosting Applications

Efficient CUDA Debugging: How to Hunt Bugs with NVIDIA Compute Sanitizer