Constructing CUDA Graphs with Dynamic Parameters

Ever since its introduction in CUDA 10, CUDA Graphs has been used in a variety of applications. A graph groups a set of CUDA kernels and other CUDA operations together and executes them with a specified dependency tree. It speeds up the workflow by combining the driver activities associated with CUDA kernel launches and CUDA API calls. It also enforces the dependencies with hardware accelerations, instead of relying solely on CUDA streams and events, when possible.

There are two main ways to construct a CUDA graph: explicit API calls and stream capture.

Construct a CUDA graph with explicit API calls

With this way of constructing a CUDA graph, nodes of the graph, formed by the CUDA kernel and CUDA memory operations, are added to the graph by calling the cudaGraphAdd*Node APIs, where * is replaced with the node type. Dependencies between the nodes are set explicitly with APIs.

The upside of constructing CUDA graphs with explicit APIs is that the cudaGraphAdd*Node APIs return node handles (cudaGraphNode_t) that can be used as references for future node updates. Kernel launch configurations and kernel function parameters of a kernel node in an instantiated graph, for example, can be updated with minimal cost with cudaGraphExecKernelNodeSetParams.

The downside is that in scenarios where CUDA graph is used to speed up existing code, constructing CUDA graphs with explicit API calls typically requires a significant number of code changes, especially changes regarding the control flow and function calling structure of the code.

Construct a CUDA graph with stream capture

With this way of constructing a CUDA graph, cudaStreamBeginCapture and cudaStreamEndCapture are placed before and after a code block. All device activities launched by the code block are recorded, captured, and grouped into a CUDA graph. The dependencies among the nodes are inferred from the CUDA stream or event API calls within the stream capture region.

The upside of constructing CUDA graphs with stream capture is that for existing code, fewer code changes are needed. The original code structure can be mostly untouched and graph construction is performed in an automatic way.

There are also downsides to this way of constructing CUDA graphs. Within the stream capture region, all kernel launch configurations and kernel function parameters, as well as the CUDA API call parameters are recorded by value. Whenever any of the configurations and parameters change, the captured and then instantiated graph becomes out-of-date.

Two solutions are provided in the Employing CUDA Graphs in a Dynamic Environment post:

The workflow is recaptured. A reinstantiation isn’t needed when the recaptured graph has the same node topology as the instantiated graph, and a whole-graph update can be performed with cudaGraphExecUpdate.
Cache CUDA graphs with the set of configurations and parameters as the key. Each set of configurations and parameters is associated with a distinct CUDA graph within the cache. When running the workflow, the set of configurations and parameters are first abstracted into a key. Then the corresponding graph, if it already exists, is found in the cache and launched.

There are, however, workflows where neither solution works well. The recapture-then-update approach works well on paper, but in some cases the recapture and update themselves are expensive. There are also cases where it is simply not possible to associate each set of parameters with a CUDA graph. For example, cases with floating-point number parameters are difficult to cache as there are huge numbers of possible floating-point numbers.

CUDA Graphs constructed with explicit APIs are easy to update but the approach can be too cumbersome and is less flexible. CUDA Graphs can be constructed flexibly with stream capture but the resulting graphs are difficult and expensive to update.

Combined approach

In this post, I provide an approach of constructing CUDA graphs with both the explicit API and stream capture methods, thus achieving the upsides of both and avoiding the downsides of either.

As an example, in a workflow where three kernels are launched sequentially, the first two kernels have static launch configurations and parameters, but the last kernel has a dynamic launch configuration and parameters.

Use stream capture to record the launches of the first two kernels and call explicit APIs to add the last kernel node to the capturing graph. The node handle returned by the explicit APIs is then used to update the instantiated graph with the dynamic configurations and parameters every time before the graph is launched.

The following code example shows the idea:

cudaStream_t stream;
std::vector<cudaGraphNode_t> _node_list;
cudaGraphExec_t _graph_exec;

if (not using_graph) {
  
  first_static_kernel<<<1, 1, 0, stream>>>(static_parameters);
  second_static_kernel<<<1, 1, 0, stream>>>(static_parameters);
  dynamic_kernel<<<1, 1, 0, stream>>>(dynamic_parameters);

} else {

  if (capturing_graph) {

    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);

    first_static_kernel<<<1, 1, 0, stream>>>(static_parameters);
    second_static_kernel<<<1, 1, 0, stream>>>(static_parameters);

    // Get the current stream capturing graph
    cudaGraph_t _capturing_graph;
    cudaStreamCaptureStatus _capture_status;
    const cudaGraphNode_t *_deps;
    size_t _dep_count;
    cudaStreamGetCaptureInfo_v2(stream, &_capture_status, nullptr &_capturing_graph, &_deps, &_dep_count);

    // Manually add a new kernel node
    cudaGraphNode_t new_node;
    cudakernelNodeParams _dynamic_params_cuda;
    cudaGraphAddKernelNode(&new_node, _capturing_graph, _deps, _dep_count, &_dynamic_params_cuda);
    // ... and store the new node for future references
    _node_list.push_back(new_node);

    // Update the stream dependencies
    cudaStreamUpdateCaptureDependencies(stream, &new_node, 1, 1); 

    // End the capture and instantiate the graph
    cudaGraph_t _captured_graph;
    cudaStreamEndCapture(stream, &_captured_graph);

    cudaGraphInstantiate(&_graph_exec, _captured_graph, nullptr, nullptr, 0);

  } else if (updating_graph) {
    cudakernelNodeParams _dynamic_params_updated_cuda;
  
    cudaGraphExecKernelNodeSetParams(_graph_exec, _node_list[0], &_dynamic_params_updated_cuda);

  }
}

In this example, cudaStreamGetCaptureInfo_v2 extracts the CUDA graph that is currently being recorded and captured into. A kernel node is added to this graph with the node handle (new_node) returned and stored, before cudaStreamUpdateCaptureDependencies is called to update the dependency tree of the current capturing stream. The last step is necessary to ensure that any other activities captured afterward have their dependencies set on these manually added nodes correctly.

With this approach, the same instantiated graph (cudaGraphExec_t object) can be reused directly with a lightweight cudaGraphExecKernelNodeSetParams call, even though the parameters are dynamic. The first image in this post shows this usage.

Furthermore, the capture and update code paths can be combined into one piece of code that lives next to the original code that launches the last two kernels. This inflicts a minimal number of code changes and does not break the original control flow and function call structure.

The new approach is shown in detail in the hummingtree/cuda-graph-with-dynamic-parameters standalone code example. cudaStreamGetCaptureInfo_v2 and cudaStreamUpdateCaptureDependencies are new CUDA runtime APIs introduced in CUDA 11.3.

Performance results

Using the hummingtree/cuda-graph-with-dynamic-parameters standalone code example, I measured the performance of running the same dynamic workflow that is bound by kernel launch overhead with three different approaches:

Running without CUDA graph acceleration
Running CUDA graph with the recapture-then-update approach
Running CUDA graph with the combined approach introduced in this post

Table 1 shows the results. The speedup from the approaches mentioned in this post strongly depends on the underlying workflow.

Approach	Time	Speedup over no graph
Combined	433 ms	1.63
Recapture-then-update	580 ms	1.22
No CUDA Graph	706 ms	1.00

Table 1. Performance results of running on an A100-40GB GPU and Intel Xeon Silver 4110 CPU at 2.10GHz

Conclusion

In this post, I introduced an approach to constructing CUDA graphs that combines both the explicit API and stream capture methods. It provides a way to reuse instantiated graphs for workflows with dynamic parameters at minimal cost.

In addition to the CUDA technical posts mentioned earlier, the CUDA Graph section of the CUDA Programming Guide provides a comprehensive introduction to CUDA Graphs and its usages. For useful tips on employing CUDA Graphs in various applications, see the Nearly Effortless CUDA Graphs GTC session.