Simulation / Modeling / Design

Advanced API Performance: CPUs

A graphic of a computer sending code to multiple stacks.

This post covers CPU best practices when working with NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

To get the best performance from your NVIDIA GPU, pair it with efficient work delegation on the CPU. Frame-rate caps, stutter, and other subpar application performance events can often be traced back to a bottleneck on the CPU. Use the following tips to understand what you should do and what to avoid.

Multithreading and workload balancing

No amount of GPU work optimization will overcome a CPU bottleneck. Evenly balance work across all threads for best results.

  • Balance command list creation and recording across all threads. The main design philosophy of D3D12 and Vulkan is to enable game engines to distribute graphics workloads across multiple CPU cores.
  • Close and reset command lists on recording worker threads.
  • CPU-intensive command lists should not be recorded on the same thread on which ExecuteCommandLists is called. Typically, ExecuteCommandLists is serialized after command list recording for a given frame. Keeping that on a separate thread from all other command list recording threads enables subsequent frame CPU work to begin with less-complicated load balancing.
  • Fine-grained query use adds CPU overhead, for example on timing around draw calls.

ExecuteCommandLists and multiple command queues

ExecuteCommandLists submits an array of command lists (ECL) to the GPU for execution. NVIDIA hardware supports multiple command queues to parallelize graphics work, enabling graphics-compute or compute-compute work to be performed concurrently.

  • Minimize ExecuteCommandLists calls as much as possible.
  • Consider the overhead added by synchronization when executing command lists on multiple queues. The fewer signal or wait fence calls, the better on the CPU. This should be balanced with the ability to maximize workload overlap on the GPU.

Resource allocation and destruction

Creating and destroying buffers, textures, and shaders is fundamental to efficient computer graphics.

  • Use a dedicated thread for resource creation to avoid hidden OS costs and blocking frame rendering, as this can result in costly OS paging work.
  • Free threaded resource creation may also be a natural fit for async copy queue uploads, which would enable completely free threaded data uploads to vidmem for newly allocated resources. Structuring uploads this way avoids adding hidden overhead to frame rendering. However, be aware that additional queues and synchronization between queues may also add CPU overhead.


Ray tracing acceleration structures are data structures that organize the geometric information of a scene to optimize the intersection tests between rays and scene objects. BuildRaytracingAccelerationStructure performs the initial construction of the acceleration structure with the scene geometry.

  • Record on a separate thread when using BuildRaytracingAccelerationStructure, preferably scheduling on an async compute queue. This API is CPU-intensive and can dominate command list recording time. 
  • Be wary of CPU overhead directly related to geometric complexity for full builds. Rebuilds should be relatively fixed overhead.
  • Be aware of the extra CPU overhead associated with FAST_TRACE builds.

For more information, see Best Practices: Using NVIDIA RTX Ray Tracing.

CreatePipelineState and CreateStateObject

CreatePipelineState is used to create a rendering pipeline state object that defines the configuration of the graphics pipeline. The pipeline state object encapsulates all of the state required to execute a graphics command, such as the input layout, shader programs, blending state, depth-stencil state, and rasterizer state.

CreateStateObject enables developers to create a state object that encapsulates the state of the graphics pipeline as a whole. The state object includes the pipeline state object created using CreatePipelineState, as well as other state information such as the viewport, scissor rectangle, and input layout.

  • Use AddToStateObject to incrementally add shader code to an existing ray tracing state object and avoid unnecessary CPU overhead.
  • Avoid needlessly creating pipeline state objects and ray tracing objects. These involve shader creation, which can consume substantial CPU cycles. Shader complexity directly affects creation-call complexity.
Discuss (0)