Simulation / Modeling / Design

CUDA Context-Independent Module Loading

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In most cases, you want to load identical device code on all devices. This requires loading device code into each CUDA context explicitly. Moreover, libraries and frameworks that do not control context creation and destruction must keep track of them to explicitly load and unload modules. 

This post discusses context-independent loading introduced in CUDA 12.0, which solves these problems.

Context-dependent loading

Traditionally, module loading has always been associated with a CUDA context. The following code example shows the traditional way of loading identical device code into two devices and then launching kernels on them.

// Device 0
cuDeviceGet(&device0, 0);
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuModuleLoad(&module0, “myModule.cubin”);
// Device 1
cuDeviceGet(&device1, 1);
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuModuleLoad(&module1, “myModule.cubin”);

Launching a kernel on each of the devices requires you to retrieve a per-module CUfunction as shown in the following code example:

// Device 0
cuModuleGetFuntion(&function0, module0, “myKernel”);
cuLaunchKernel(function0, …);
// Device 1
cuModuleGetFuntion(&function1, module1, “myKernel”);
cuLaunchKernel(function1, …);

This increases code complexity in the application as you must retrieve and track the per-context and per-module types. You also have to unload each module explicitly by using the cuModuleUnload API.

The problem is exacerbated when libraries or frameworks primarily use CUDA driver APIs for loading their own modules. They may not have complete control over the lifetime of contexts owned by the application.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  map<CUcontext, CUmodule> moduleContextMap;
}

libraryFunc() {
  cuCtxGetCurrent(&ctx);
  if (!moduleContextMap.contains(ctx)){
    cuModuleLoad(&module, “myModule.cubin”);
    moduleContextMap[ctx] = module;
  }
  else {
    module = moduleContextMap[ctx];
  }

  cuModuleGetFuntion(&function, module, “myKernel”);
  cuLaunchKernel(function, …);
}

libraryDeinitialize() {
  moduleContextMap.clear();
}

In the code example, the library must check for new contexts and load modules into them explicitly. It also must maintain state to check if the module is already loaded into the context. 

Ideally, the state can be freed after the context is destroyed. However, this is not possible if the library has no control over the lifetime of contexts. 

This means that the freeing of resources must be delayed until library deinitialization. This not only increases code complexity, but it also causes the library to hold on to resources longer than it must, potentially denying another portion of the application from using that memory.

Another alternative is for libraries and frameworks to force additional constraints on the users to ensure that they have sufficient control over resource allocation and cleanup.

Context-independent loading

CUDA 12.0 introduces context-independent loading with the addition of the cuLibrary* and cuKernel* APIs, which solve these problems. With context-independent loading, the loading and unloading of modules into each CUDA context is done automatically by the CUDA driver as contexts are created and destroyed by the application.

// Load library
cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
cuLibraryGetKernel(&kernel, library, “myKernel”);

// Launch kernel on the primary context of device 0
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuLaunchKernel((CUfunction)kernel, …);

// Launch kernel on the primary context of device 1
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuLaunchKernel((CUfunction)kernel, …);

// Unload library
cuLibraryUnload(library);

As shown in the code example, the cuLibraryLoadFromFile API takes care of loading the module when a context is created or initialized. In the example, this is done during cuDevicePrimaryCtxRetain

Note that the API is used to launch context-less kernel CUkernel by querying the handle using cuLibraryGetKernel() and then passing it to the API by casting to CUfunction. Here, the context to launch the kernel will either be taken from the specified stream hStream or the current context in case of NULL stream.

Moreover, you can now launch the kernels using the context-independent handle CUkernel, rather than having to maintain a per-context CUfunctioncuLibraryGetKernel retrieves a context-independent handle to the device function myKernel. The device function can then be launched with cuLaunchKernel by specifying the context-independent handle CUkernel. The CUDA driver takes care of launching the device function in the appropriate context based on the context that is active at that point.

Libraries and frameworks can now simply load and unload modules one time during initialization and deinitialization, respectively.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
  cuLibraryGetKernel(&kernel, library, “myKernel”);
}

libraryFunc() {
  cuLaunchKernel((CUfunction)kernel, …);
}

libraryDeinitialize() {
  cuLibraryUnload(library);
}

The library does not have to maintain and track per-context states anymore. The design of context-independent loading enables the CUDA driver to track modules and contexts and carry out the work of loading and unloading modules.

Accessing __managed__ variables

Managed variables can be referenced from both device and host code. For example, the address of a managed variable can be queried or it can be read or written directly from a device or host function. Unlike __device__ variables, which have the lifetime of a CUDA context in which it is created, __managed__ variables belonging to a module point to the same memory across all CUDA contexts or even devices. 

Before CUDA 12.0, there was no way to retrieve a handle through the driver API to a managed variable that would be unique across CUDA contexts. CUDA 12.0 introduces a new driver API cuLibraryGetManaged, which makes it possible to get a unique handle across CUDA contexts.

Get started with context-independent loading

In this post, we introduced new CUDA driver APIs that provide the ability to load device code independent of a CUDA context. We also discussed context-independent handles to launch kernels. Together, they provide a simpler way to load and execute code on the GPU in comparison to the traditional loading mechanisms, reducing code complexity and avoiding the need for maintaining per-context states. 

To start using these APIs, download the CUDA Driver and Toolkit version 12 or higher. For more information about the cuLibrary* and cuKernel* APIs, see the CUDA Driver API documentation.

Discuss (2)

Tags