Simulation / Modeling / Design

CUDA Context-Independent Module Loading

Dec 12, 2022

By Shelton Dsouza, Maja Piechotka and Fred Oh

Discuss (2)

AI-Generated Summary

Dislike

CUDA 12.0 introduces context-independent loading, which allows the CUDA driver to automatically load and unload modules into each CUDA context as they are created and destroyed.
The new cuLibrary* and cuKernel* APIs enable context-independent loading and provide a simpler way to load and execute code on the GPU, reducing code complexity.
The cuLibraryGetKernel API retrieves a context-independent handle to a device function, which can be launched using cuLaunchKernel without needing to maintain a per-context CUfunction.

AI-generated content may summarize information incompletely. Verify important information. Learn more

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In most cases, you want to load identical device code on all devices. This requires loading device code into each CUDA context explicitly. Moreover, libraries and frameworks that do not control context creation and destruction must keep track of them to explicitly load and unload modules.

This post discusses context-independent loading introduced in CUDA 12.0, which solves these problems.

Context-dependent loading

Traditionally, module loading has always been associated with a CUDA context. The following code example shows the traditional way of loading identical device code into two devices and then launching kernels on them.

// Device 0
cuDeviceGet(&device0, 0);
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuModuleLoad(&module0, “myModule.cubin”);
// Device 1
cuDeviceGet(&device1, 1);
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuModuleLoad(&module1, “myModule.cubin”);

Launching a kernel on each of the devices requires you to retrieve a per-module CUfunction as shown in the following code example:

// Device 0
cuModuleGetFuntion(&function0, module0, “myKernel”);
cuLaunchKernel(function0, …);
// Device 1
cuModuleGetFuntion(&function1, module1, “myKernel”);
cuLaunchKernel(function1, …);

This increases code complexity in the application as you must retrieve and track the per-context and per-module types. You also have to unload each module explicitly by using the cuModuleUnload API.

The problem is exacerbated when libraries or frameworks primarily use CUDA driver APIs for loading their own modules. They may not have complete control over the lifetime of contexts owned by the application.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  map<CUcontext, CUmodule> moduleContextMap;
}

libraryFunc() {
  cuCtxGetCurrent(&ctx);
  if (!moduleContextMap.contains(ctx)){
    cuModuleLoad(&module, “myModule.cubin”);
    moduleContextMap[ctx] = module;
  }
  else {
    module = moduleContextMap[ctx];
  }

  cuModuleGetFuntion(&function, module, “myKernel”);
  cuLaunchKernel(function, …);
}

libraryDeinitialize() {
  moduleContextMap.clear();
}

In the code example, the library must check for new contexts and load modules into them explicitly. It also must maintain state to check if the module is already loaded into the context.

Ideally, the state can be freed after the context is destroyed. However, this is not possible if the library has no control over the lifetime of contexts.

This means that the freeing of resources must be delayed until library deinitialization. This not only increases code complexity, but it also causes the library to hold on to resources longer than it must, potentially denying another portion of the application from using that memory.

Another alternative is for libraries and frameworks to force additional constraints on the users to ensure that they have sufficient control over resource allocation and cleanup.

Context-independent loading

CUDA 12.0 introduces context-independent loading with the addition of the cuLibrary* and cuKernel* APIs, which solve these problems. With context-independent loading, the loading and unloading of modules into each CUDA context is done automatically by the CUDA driver as contexts are created and destroyed by the application.

// Load library
cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
cuLibraryGetKernel(&kernel, library, “myKernel”);

// Launch kernel on the primary context of device 0
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuLaunchKernel((CUfunction)kernel, …);

// Launch kernel on the primary context of device 1
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuLaunchKernel((CUfunction)kernel, …);

// Unload library
cuLibraryUnload(library);

As shown in the code example, the cuLibraryLoadFromFile API takes care of loading the module when a context is created or initialized. In the example, this is done during cuDevicePrimaryCtxRetain.

Note that the API is used to launch context-less kernel CUkernel by querying the handle using cuLibraryGetKernel() and then passing it to the API by casting to CUfunction. Here, the context to launch the kernel will either be taken from the specified stream hStream or the current context in case of NULL stream.

Moreover, you can now launch the kernels using the context-independent handle CUkernel, rather than having to maintain a per-context CUfunction. cuLibraryGetKernel retrieves a context-independent handle to the device function myKernel. The device function can then be launched with cuLaunchKernel by specifying the context-independent handle CUkernel. The CUDA driver takes care of launching the device function in the appropriate context based on the context that is active at that point.

Libraries and frameworks can now simply load and unload modules one time during initialization and deinitialization, respectively.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
  cuLibraryGetKernel(&kernel, library, “myKernel”);
}

libraryFunc() {
  cuLaunchKernel((CUfunction)kernel, …);
}

libraryDeinitialize() {
  cuLibraryUnload(library);
}

The library does not have to maintain and track per-context states anymore. The design of context-independent loading enables the CUDA driver to track modules and contexts and carry out the work of loading and unloading modules.

Accessing managed variables

Managed variables can be referenced from both device and host code. For example, the address of a managed variable can be queried or it can be read or written directly from a device or host function. Unlike __device__ variables, which have the lifetime of a CUDA context in which it is created, __managed__ variables belonging to a module point to the same memory across all CUDA contexts or even devices.

Before CUDA 12.0, there was no way to retrieve a handle through the driver API to a managed variable that would be unique across CUDA contexts. CUDA 12.0 introduces a new driver API cuLibraryGetManaged, which makes it possible to get a unique handle across CUDA contexts.

Get started with context-independent loading

In this post, we introduced new CUDA driver APIs that provide the ability to load device code independent of a CUDA context. We also discussed context-independent handles to launch kernels. Together, they provide a simpler way to load and execute code on the GPU in comparison to the traditional loading mechanisms, reducing code complexity and avoiding the need for maintaining per-context states.

To start using these APIs, download the CUDA Driver and Toolkit version 12 or higher. For more information about the cuLibrary* and cuKernel* APIs, see the CUDA Driver API documentation.

Discuss (2)

About the Authors

About Shelton Dsouza
Shelton is a senior system software engineer at NVIDIA working on CUDA driver and runtime. His primary focus is on designing forward-looking improvements to the CUDA APIs and programming model and improving ease of use. He graduated with a master's degree from Carnegie Mellon University in 2016

View all posts by Shelton Dsouza

About Maja Piechotka
Maja is a senior software engineer working on CUDA Driver and Runtime. She is interested in the intersection of API and language design. Maja graduated with a master’s degree from Imperial College London in 2013.

View all posts by Maja Piechotka

About Fred Oh
Fred is a senior product marketing manager for CUDA, CUDA on WSL, and CUDA Python. Fred has a B.S. in Computer Science and Math from UC Davis. He began his career as a UNIX software engineer porting kernel services and device drivers to x86 architectures. He loves Star Wars, Star Trek and the NBA Warriors.

View all posts by Fred Oh