Technical Walkthrough

Advanced API Performance: Command Buffers

Discuss (1)

This post covers best practices for command buffers on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Command buffers are the main mechanism for sending commands from the CPU to be executed on the GPU. By following the best practices listed in this post, you can achieve performance gains on both the CPU and the GPU by maximizing parallelism, avoiding bottlenecks, and reducing idle times on the GPU.

Recommended

  • Accept the fact that you are responsible for achieving and controlling GPU/CPU parallelism.
    • Submitting work to command lists doesn’t start any work on the GPU.
    • Calls to ExecuteCommandList finally do start work on the GPU.
  • Record work in parallel and evenly across several threads and cores to multiple command lists.
    • Recording commands is a CPU-intensive operation and no driver threads come to the rescue.
    • Command lists are not free-threaded so parallel work submission means submitting to multiple command lists.
  • Be aware of the fact that there is a cost associated with setup and reset of a command list.
    • You still need a reasonable number of command lists for efficient parallel work submission.
    • Fences force the splitting of command lists for various reasons (multiple command queues, picking up the results of queries, and so on).
  • Try to aim for 5-10 ExecuteCommandList calls per frame, with sufficient amount of GPU work to hide the OS scheduling overhead per ExecuteCommandList call.
    • The OS takes 50-80 microseconds to schedule command lists after the previous ExecuteCommandList call. If the command lists in the call execute faster than that, there is a bubble in the hardware queue.
    • Check for bubbles using GPUView.
  • You can overlap graphics or compute work on the 3D queue with compute work on a dedicated asynchronous compute queue.
    • Keep in mind even for compute tasks that can in theory run in parallel with other graphics or compute tasks, the actual scheduling details of the parallel work on the GPU may not generate the hoped for results.
  • Be conscious of which asynchronous compute and graphics workloads can be scheduled together. Use fences to pair up the right workloads.
  • Use ExecuteIndirect flexibility to maximize offloading CPU work to the GPU and reduce CPU-GPU synchronization points.
    • Do port your scene culling system to the GPU using ExecuteIndirect.
    • Use ExecuteIndirect count buffer to control the number of commands instead of issuing the maximum number of commands and predicating unused ones individually.
    • NVIDIA provides additional capabilities for ExecuteIndirect under Vulkan’s VK_NV_device_generated_commands.

Not recommended

  • Don’t exceed 1 million CBV/SRV/UAV descriptors or 2K samplers in your frame’s descriptor heap.
  • Don’t block on ExecuteCommandList calls.
    • ExecuteCommandList calls can be expensive. New commands can be recorded in the meantime on other threads.
    • Each command queue can use its own thread to submit ExecuteCommandList.
  • Don’t record everything or big scene parts in just a few command lists. This limits your ability to fully use all your CPU cores.
    • Also, building a few large command lists means that you’ll potentially find it harder to keep the GPU from going idle.
  • Don’t submit only at the end of frame after you have recorded everything. You may waste the opportunity to keep the GPU working in parallel with the recording of other command lists.
  • Don’t expect lots of list reuse.
    • There are usually many per-frame changes in terms of objects visibility, and so on.
    • Post-processing may be an exception.
  • Don’t frequently mix draw, dispatch and copy commands.
    • Try to bunch all draw commands together and dispatch commands together, and so on.
    • A frequent mix of different types of work on the same queue can introduce pipeline drains.
  • Don’t create too many threads or too many command lists.
    • Too many threads oversubscribes your CPU resources, while too many command lists may accumulate too much overhead.