Simulation / Modeling / Design

Advanced API Performance: Async Copy

A graphic of a computer sending code to multiple stacks.

This post covers best practices for async copy on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Async copy runs on completely independent hardware but you have to schedule it onto the separate queue. You can consider turning an async copy into an async compute as a performance strategy. NVIDIA has a dedicated async copy engine. Use the following strategies in decreasing order of performance improvement:

  • Full parallelism: Use async copy.
  • Partial parallelism: Turn a synchronous copy into an async compute through a driver performance strategy. The compute workload overlaps with the graphics workload.
  • No parallelism: Perform serial execution of the copy and graphics work.
  • Negative scaling: Turn a synchronous copy into an async compute, but it takes longer due to conflicting SOLs.

Partial parallelism requires the devs to create and manage a separate copy queue, requiring fences and possible scheduling complications. In this case, it is more worth it to just turn a synchronous copy into an async compute rather than async copy. However, any work that can not be turned into an async compute also can not be turned into an async copy, and the reverse.

  • Put copy work onto the async copy queue to use the NVIDIA RTX dedicated Asynchronous Copy Engine to speed up and parallelize copy work.
  • If synchronizing the copy queue would be too technically complex, turn copy work into async compute work instead of async copy.
  • Use updateTileMappings on an async copy queue, with sufficient latency to cover variable update costs. This mitigates costs associated with updating on the critical direct and async compute queues
  • Don’t forget to use fences/semaphores to schedule your async copy work with the graphics queue, which can create race conditions.
    • Minimize the number of fences used to avoid unnecessary idling. 
  • Don’t put work onto the copy queue to be used immediately or soon after, as it not only runs serially with the graphics queue but also incurs an overhead of switching engines.
  • Don’t move local GPU copies onto the async copy queue as the incurred overhead likely makes it unworthwhile.
  • Don’t use async copies requiring GPU bandwidth saturation, unless you have sufficient time to cover the cycles. The async copy engine is generally built to saturate PCIE bandwidth, but it can be used as a generic copy engine if sufficient time is given to cover the cost of the transfer at those speeds.
Diagram shows the wait and signal fences used in the graphics and async copy queues for copy work tasks.
Figure 1. Graphics queue with async copy queue


Thank you to Patrick Neill, Alan Wolfe, and Mike Murphy for your help in advising and reviewing this post.

Discuss (1)