Simulation / Modeling / Design

Constant Time Launch for Straight-Line CUDA Graphs and Other Performance Enhancements

Decorative image of light fields in green, purple, and blue.

CUDA Graphs are a way to define and batch GPU operations as a graph rather than a sequence of stream launches. A CUDA Graph groups a set of CUDA kernels and other CUDA operations together and executes them with a specified dependency tree. It speeds up the workflow by combining the driver activities associated with CUDA kernel launches and CUDA API calls. It also enforces the dependencies with hardware accelerations, instead of relying solely on CUDA streams and events, when possible.

CUDA Graphs are particularly important for AI frameworks, as they enable you to capture and replay a sequence of CUDA operations, reducing CPU overhead and improving performance. With the latest improvements, you can now take even greater advantage of CUDA Graphs to accelerate AI workloads.

NVIDIA has improved the performance of CUDA Graphs in several respects between CUDA Toolkit 11.8 and CUDA Toolkit 12.6 and the accompanying driver builds:

  • Graph construction and instantiation time
  • CPU launch overhead 
  • General performance in several edge cases

In this post, we present some microbenchmark numbers for various graphs and discuss the improvements since the initial release. The benchmark code is included. As always, performance characteristics depend on the CPU, GPU, and clock settings. For our testing, we used an Intel Xeon Silver 4208 Processor @ 2.10GHz and an NVIDIA GeForce RTX 3060 GPU. The OS was Ubuntu 22.04. The results in this post were collected using default clock settings.

This is a snapshot of performance improvements as of CUDA 12.6, so future CUDA releases may not demonstrate the same performance as documented in this post. This snapshot is presented assuming familiarity with CUDA Graphs and how their performance characteristics impact applications. For more information, see Getting Started with CUDA Graphs

Definitions

Here are some graph topology terms and performance characteristics discussed later in this post.

  • Straight line kernel graph: A CUDA Graph made completely out of kernel nodes, where every node has a single dependent node except for the last node.
  • Parallel straight line: A CUDA Graph that has width as the number of entry points. Each entry point is followed by its own set of lengths as the number of nodes arrayed in a straight-line formation. This is also referred to as parallel chain in the charts and source code.
  • First launch: The first launch of a graph, which also includes uploading the instantiated graph to the device.
  • Repeat launch: Launching a graph that has already been uploaded to the device.
  • CPU overhead for a launch: CPU time between invoking CUDAGraphLaunch and the completion of the function. 
  • Device side runtime: The amount of time for the operation to run on the device.
  • Repeat launch device side runtime: The amount of time it takes for a graph that was previously uploaded to run on the device.
  • Time to solution: Time from the launch call to the completion of the graph running on the device.

Constant time CPU overhead for repeat launch

One notable improvement highlighted is related to CPU overhead for repeat launch on the NVIDIA Ampere architecture. For straight-line kernel graphs, there has been a significant reduction in time taken during repeat launches. Specifically, the time decreased from 2μs + 200 ns extra time per node to a nearly constant time of 2.5us + (~1ns per node), showcasing consistent speed enhancements for graphs with 10 nodes or more.

Launch still takes O(n) time on pre-NVIDIA Ampere architecture hardware.

Line chart shows that repeat launch CPU overhead had a linear dependency on the graph length in CUDA Toolkit 11.8.  It has a near-constant behavior in CUDA Toolkit 12.6. The line chart covers small straight line graph topologies up to 250 nodes in length where the overhead reaches 25 uS in 11.8 and is about 2 uS in 12.6.
Figure 1. CPU overhead of repeat launch of straight line graphs for small straight line topology
Line chart shows that the relationships from Figure 1 extend to larger straight-line graphs. The chart extends the relationship to graphs of 2000 nodes in size, where the overhead was 300 uS in 11.8.
Figure 2. CPU overhead of repeat launch of straight-line graphs for large straight-line topology

Because the parallel straight-line graphs measured here have four root nodes, the repeat launch with one node takes 6μs instead of 2μs. The impact of chain length on the CPU launch overhead is still negligible for this topology. The exact scaling of having multiple root nodes is not analyzed here, although the impact is expected to be linear with the number of root nodes. Similarly, graphs with non-kernel nodes would have different performance characteristics.

Line chart shows that the relationships from Figure 1 extend to parallel chain topologies. The chart shows that a parallel chain topology with four chains and a length of 250 nodes has a launch overhead of >60 uS in 11.8 and takes close to 6 uS in 12.6.
Figure 3. CPU overhead of repeat launch of a graph with four parallel task chains

Other measurable performance gains

Several other aspects of performance can matter for applications benefitting from graphs:

  • Instantiation time
  • First launch CPU overhead
  • Repeat launch device runtime
  • End-to-end time for repeat graph launch into an empty stream
metrictopologyTopology length11.8 (r520.61.05)12.6 (r560.28.03)%speedup
InstantiationStraight line1020 uS16 uS25%
100168 uS127 uS32%
10252143 uS1526 uS40%
4 parallel chains1071 uS58 uS22%
100695 uS552 uS26%
First Launch Cpu OverheadStraight line104 uS4uS0%
10025 uS15 uS66%
1025278 uS175 uS59%
4 parallel chains10073 uS67 uS9%
Repeat Launch Device RuntimeStraight line107 uS7 uS0% better
10061 uS53 uS15% better
1025629 uS567 uS11% better
End to end time for repeat graph launch into an empty streamStraight line1012 uS9 uS30% better
10069 uS55 uS25% better
1025628 uS567 uS11% better
Table 1. Metric performance on CUDA Toolkit 11.8 and 12.6

Instantiation time

The amount of time to instantiate a graph dominates the runtime cost of adopting a graph. This is a one-time cost per graph instantiation. Graph creation time often appears on the critical path of application startup or for workflows that need to periodically create new graphs at times they do not have work running on the GPU.

Improvements here can impact application startup time for applications that prepare many graphs at program start and minimize the end-to-end latency penalty to other applications that file to hide graph construction latency behind other work execution.

First launch CPU overhead

As first launch is responsible for uploading the work descriptions for a CUDA Graph to the GPU, it still has an O(N) CPU cost. This cost is paid the first time and doesn’t need to be repaid except partially when the graph is updated. Most of this cost can be paid by performing a separate upload operation, which is not discussed in this post because it’s already available to CUDA Graphs users.

Repeat launch device runtime

Running even empty kernels takes time. Optimizing some inter-kernel latency for straight-line graphs reduces device times by up to a noticeable 60 ns/node.

End-to-end time for repeat graph launch into an empty stream

CUDA Graphs contains logic that may allow execution to begin before the launch operation processes all the nodes in a graph, hiding part of the launch cost. Due to this, the launch overhead benefits aren’t fully seen in the time to work completion. 

Instead, you see a more modest improvement of about 60ns/node benefit in time to solution for straight line graphs.

Scheduling based on node creation order

In CUDA Toolkit 12.0, CUDA started paying attention to node creation order as a heuristic to scheduling decisions. CUDA started loosely preferring to schedule node operations for nodes created earlier over nodes created later. 

Specifically, the order of the root nodes was fixed and a pass that ordered the nodes by a breadth-first traversal was removed. The logic originally was designed to keep root nodes at the start of the graph’s node list and to prioritize making parallel work available for the GPU to run. 

In cases where memcopy operations would be assigned to the same copy engine and falsely serialize, the old heuristic would often wind up unintentionally serializing the memcopy operations needed for the first compute items after memcopy operations that were only needed for later compute tasks. This caused the graph to have bubbles in the work available to the compute engines. So, developers often created the graph nodes in a deliberate order: nodes needed early in the graph were created first and nodes needed later were created later.  

By paying attention to node creation order when making some scheduling decisions, CUDA can give you scheduling that better matches your intuitive expectations. 

Microbenchmark source code

The microbenchmarks code has been added to the /NVIDIA/cuda-samples as the cudaGraphsPerfScaling application. Compile the app and run the dataCollection.bash script with your GPU name and CUDA driver version to generate the data.

Summary

New optimizations to CUDA graphs make a compelling argument for seeing if they translate to real performance improvements in applications using graphs.

In this post, we discussed CUDA Graphs performance improvements after CUDA 11.8. These improvements include enhancements in graph construction and instantiation time, CPU launch overhead, and overall performance in various scenarios. 

For more information, see the following resources:

Discuss (0)

Tags