Rendering / Ray Tracing

Advanced API Performance: Synchronization

A graphic of a computer sending code to multiple stacks.

Synchronization in graphics programming refers to the coordination and control of concurrent operations to ensure the correct and predictable execution of rendering tasks. Improper synchronization across the CPU and GPU can lead to slow performance, race conditions, and visual artifacts.


  • If running workloads asynchronously, make sure that they stress different GPU units. For example, pair bandwidth-heavy tasks with math-heavy tasks. That is, use z-prepass and BVH build or post-processing.
  • Always verify whether the asynchronous implementation is faster across the different architectures.
  • Asynchronous work can belong to different frames. Using this technique can help find better-paired workloads.
  • Wait and signal the absolute minimum of semaphores/fences. Every excessive semaphore/fence can introduce a bubble in a pipeline.
  • Use GPU profiling tools (NVIDIA Nsight Graphics in GPU trace mode, PIX, or GPUView) to see how well work overlaps and fences play together without stalling one queue or another.
  • To avoid extra synchronizations and resource barriers, asynchronous copy/transfer work can be done in compute queue.

Not recommended

  • Do not create queues that you don’t use.
    • Each additional queue adds processing overhead.
    • Multiple asynchronous compute queues will not overlap, due to the OS scheduler, unless hardware scheduling is enabled. For more information, see Hardware Accelerated GPU Scheduling.
  • Avoid tiny asynchronous tasks and group them if possible. Asynchronous workloads that take <0.2 ms are unlikely to show any benefits, as this is an approximate amount of time to resolve fences pre-hardware scheduling.
  • Avoid using fences to synchronize work within the queue. Command lists/buffers are guaranteed to be executed in order of submission within a command queue by specification.
  • Semaphores/fences should not be used instead of resource barriers. They are way more expensive and support different purposes.
  • Do not implement low-occupancy workloads with the intention to align them with more work on the graphics queue. GPU capabilities may change and low-occupancy work might become a long, trailing tail that stalls another queue.
Discuss (0)