Dynamic Control Flow in CUDA Graphs with Conditional Nodes

CUDA Graphs can provide a significant performance increase, as the driver is able to optimize execution using the complete description of tasks and dependencies. Graphs provide incredible benefits for static workflows where the overhead of graph creation can be amortized over many successive launches.

However, nearly all problems involve some form of decision-making, which can require splitting up graphs and returning control to the CPU to decide which work to launch next. Breaking up the work like this compromises CUDA’s ability to make optimizations, ties up CPU resources, and adds overhead with each graph launch.

Beginning in CUDA 12.4, CUDA Graphs supports conditional nodes, which enable the conditional or repeated execution of portions of a graph without returning control to the CPU. This frees up CPU resources, enabling many more workflows to be represented in a single graph.

Conditional nodes

Conditional nodes come in two flavors:

IF nodes: the body is executed once each time the node is evaluated if the condition value is true.
WHILE nodes: the body is executed repeatedly when the node is evaluated as long as the condition value is true.

Conditional nodes are container nodes, similar to child graph nodes, but the execution of the graph contained within the node is dependent on the value of a condition variable. The condition value associated with a node is accessed by a handle that must be created prior to the node. The condition value can be set in a CUDA kernel by calling cudaGraphSetConditional. An initialization, applied at each start of the graph, can also be specified when the handle is created.

When the conditional node is created, an empty graph is also created, and the handle is returned to the user. This graph is tied to the node and will be executed based on the condition value. This conditional body graph can be populated using either the graph API or by capturing asynchronous CUDA calls using cudaStreamBeginCaptureToGraph.

Conditional nodes can also be nested. For example, you can create a conditional WHILE node with a body graph that contains a conditional IF node.

Conditional node body graphs can contain any of the following:

Kernel nodes (CNP, cooperative not currently supported)
Empty nodes
Child graph nodes
Memset nodes
Memcopy nodes
Conditional nodes

This applies recursively to child graphs and conditional bodies. All kernels, including kernels in nested conditionals or child graphs at any level, must belong to the same CUDA context. Memcopies and memsets must act on memory reachable from the conditional node’s context.

Complete samples are available in the CUDA samples repository. The next section runs through some examples to show what you can do with conditional nodes.

Conditional IF nodes

The body graph of an IF node will be executed once if the condition is non-zero whenever the IF node is evaluated. Figure 1 depicts a graph where the middle node, B, is an IF conditional node containing a four-node graph:

Diagram depicting a decision workflow using three circular nodes with a conditional node (b) set to conditionally run its body graph. — *Figure 1. Conditional IF node*

To show how this graph could be created, the following example uses node A, a kernel upstream of the conditional node, B, to set the value of the conditional based on the results of work done by that kernel. The body of the conditional is populated using the graph API.

First, define the node A kernel. This kernel sets the conditional handle depending on the result of some arbitrary calculations performed by the user.

__global__ void setHandle(cudaGraphConditionalHandle handle)
{
    unsigned int value = 0;
    // We could perform some work here and set value based on the result of that work.
    if (someCondition) { 
        // Set ‘value’ to non-zero if we want the conditional body to execute
        value = 1;
    }
    cudaGraphSetConditional(handle, value);
}

Next, define a function to construct the graph. This function allocates the conditional handle, creates the nodes, and populates the body of the conditional graph. For clarity, the code to launch and execute the graph is omitted.

cudaGraph_t createGraph() {
    cudaGraph_t graph;
    cudaGraphNode_t node;
    void *kernelArgs[1];

    cudaGraphCreate(&graph, 0);

    cudaGraphConditionalHandle handle;
    cudaGraphConditionalHandleCreate(&handle, graph);

    // Use a kernel upstream of the conditional to set the handle value
    cudaGraphNodeParams kParams = { cudaGraphNodeTypeKernel };
    kParams.kernel.func         = (void *)setHandle;
    kParams.kernel.gridDim.x    = kParams.kernel.gridDim.y = kParams.kernel.gridDim.z = 1;
    kParams.kernel.blockDim.x   = kParams.kernel.blockDim.y = kParams.kernel.blockDim.z = 1;
    kParams.kernel.kernelParams = kernelArgs;
    kernelArgs[0] = &handle;
    cudaGraphAddNode(&node, graph, NULL, 0, &kParams);

    cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
    cParams.conditional.handle = handle;
    cParams.conditional.type   = cudaGraphCondTypeIf;
    cParams.conditional.size   = 1;
    cudaGraphAddNode(&node, graph, &node, 1, &cParams);

    cudaGraph_t bodyGraph = cParams.conditional.phGraph_out[0];

    // Populate the body of the conditional node
    cudaGraphNode_t bodyNodes[4];
    cudaGraphNodeParams params[4] = { ... }; // Setup kernel parameters as needed.
    cudaGraphAddNode(&bodyNodes[0], bodyGraph, NULL, 0, &params[0]);
    cudaGraphAddNode(&bodyNodes[1], bodyGraph, &bodyNodes[0], 1, &params[1]);
    cudaGraphAddNode(&bodyNodes[2], bodyGraph, &bodyNodes[0], 1, &params[2]);
    cudaGraphAddNode(&bodyNodes[3], bodyGraph, &bodyNodes[1], 2, &params[3]);

    return graph;
}

Conditional WHILE nodes

The body graph of a WHILE node will be executed repeatedly as long as the condition is non-zero. The condition will be evaluated when the node is executed and after each completion of the body graph. The following diagram depicts a three-node graph where the middle node, B, is a WHILE conditional node containing a three-node graph.

Diagram depicting a decision workflow using three circular nodes with a conditional node (b) set to loop over its body graph. — *Figure 2. Conditional WHILE node*

To see how this graph could be created, the following example sets the handle’s default value to a non-zero value so the WHILE loop executes by default. Setting the default value to non-zero and leaving the conditional value in a kernel upstream of the conditional unmodified effectively makes a do-while loop, where the conditional body always executes at least once. Creating a WHILE loop, where the loop body only executes when the condition is true, requires performing some calculations and setting the conditional handle appropriately in node A.

In the previous example, the conditional body is populated with the graph API. In this example, the body of the conditional is populated using stream capture.

The first step is to define a kernel that sets the conditional value during each execution of the conditional body. In this example, the handle is set based on the value of a down counter.

__global__ void loopKernel(cudaGraphConditionalHandle handle)
{
    static int count = 10;
    cudaGraphSetConditional(handle, --count ? 1 : 0);
}

cudaGraph_t createGraph() {
    cudaGraph_t graph;
    cudaGraphNode_t nodes[3];

    cudaGraphCreate(&graph, 0);

    // Insert kernel node A
    cudaGraphNodeParams params = ...;

    cudaGraphAddNode(&nodes[0], graph, NULL, 0, &params);

    cudaGraphConditionalHandle handle;
    cudaGraphConditionalHandleCreate(&handle, graph, 1, cudaGraphCondAssignDefault);

    // Insert conditional node B
    cudaGraphNodeParams cParams = { cudaGraphNodeTypeConditional };
    cParams.conditional.handle = handle;
    cParams.conditional.type   = cudaGraphCondTypeWhile;
    cParams.conditional.size   = 1;
    cudaGraphAddNode(&nodes[1], graph, &nodes[0], 1, &cParams);

    cudaGraph_t bodyGraph = cParams.conditional.phGraph_out[0];
    cudaStream_t captureStream;
    cudaStreamCreate(&captureStream);
    
    // Fill out body graph with stream capture.
    cudaStreamBeginCaptureToGraph(captureStream,
                                  bodyGraph,
                                  nullptr,
                                  nullptr,
                                  0,
                                  cudaStreamCaptureModeRelaxed);
    myKernel1<<<..., captureStream>>>(...);
    myKernel2<<<..., captureStream>>>(...);
    loopKernel<<<1, 1, 0, captureStream>>>(handle);
    cudaStreamEndCapture(captureStream, nullptr);
    cudaStreamDestroy(captureStream);

    // Insert kernel node C.
    params = ...;
    cudaGraphAddNode(&nodes[2], graph, &nodes[1], 1, &params);

    return graph;
}

This example uses cudaStreamBeginCaptureToGraph, a new API added in CUDA 12.3 that enables stream capture to insert nodes into an existing graph. Using this API, multiple separate captures can be composed into a single graph object. This API also enables populating the conditional body graph object, which is created along with the conditional node.

Conclusion

CUDA Graphs provide incredible benefits for static workflows where the overhead of graph creation can be amortized over many successive launches. Eliminating splitting up graphs and returning control to the CPU to decide which launch to prioritize helps reduce CPU overhead and latency. Using CUDA Graphs with conditional nodes enables the conditional or repeated execution of portions of a graph without returning control to the CPU. This frees up CPU resources and enables a single graph to represent substantially more complex workflows.

For more information on conditional nodes, see the CUDA Programming Guide. To explore simple, complete examples, visit NVIDIA/cuda-samples on GitHub. And join the conversation on the NVIDIA Developer CUDA forums.