Optimizing Game Development with GPU Performance Events

GPU performance events can be used to instrument your game by labeling regions and marking important occurrences. A performance event represents a logical, hierarchical grouping of work, consisting of a begin/end marker pair. There are best practices for GPU performance events that are universally used by profiling tools such as NVIDIA Nsight Graphics and NVIDIA Nsight Systems, for navigating through complex frame rendering.

While all modern graphics APIs (Direct3D 11, Direct3D 12, Vulkan, and OpenGL 4.3) offer a simple solution to set these begin/end performance markers, they do not enforce the conventions that profiling tools follow. When you follow these best practices, your game works better with profiling tools and it is easier for NVIDIA engineers to help you optimize your game.

Do’s for GPU Performance Events

  • Consider enabling GPU performance events in all builds, including final releases, as this results in no significant CPU overhead (at least when issuing less than 200 markers per frame). It also requires zero GPU overhead when no GPU profilers are enabled, and can be helpful for performance and stability analysis. There is no need to enable internal profiling at the same time, though. We recommend decoupling the marker generation from any internal profiling code paths. 
  • In multi-threaded environments, set the events in the same thread that’s used to record the GPU command lists (command buffers). This provides better visibility during frame profiling, as it also shows CPU utilization for performing various graphics tasks.
  • Because there can be multiple frames in-flight in modern graphics APIs, and frame rendering has become increasingly complex, you can’t always determine the boundaries of a single frame. Use a Frame event that encompasses all other events. Exclude the Present call from this event, as it may trigger additional work at the operating system level and potentially introduce noise.
  • Make sure to provide a reasonable set of events that is just enough to understand each individual workload. For an example, see the Case Study.
    • For small workloads that consume less than 2-3% of your frame time, avoid having dedicated events rooted directly at the top-level frame. Group them into a hierarchy.
    • Consider adding different verbosity levels to your marker generation. Different tools do different types of work per marker, and this allows you to dial in the level of detail that you need per tool. While you may want full verbosity during frame debugging and frame profiling, you may want fewer levels for stutter analysis.
    • Prefer shallow layouts over complex nested schemes. Complex schemes can increase the analysis complexity and introduce more overhead, while not measurably adding more value.
  • If your application supports multiple graphics APIs, try to use the same event names. This allows certain profiling tools to do cross-API comparisons.

Don’ts for GPU Performance Events

  • Avoid incomplete events. When viewed from the linear series of submissions to a single queue, the calls to begin and end the event must be matched and balanced. Incomplete events can cause difficulties for profiling tools.
  • Avoid event names that do not remain consistent between frames or application runs. Do not use dynamically generated names (for example, runtime string construction), as it prevents static analysis tools from correlating the events. For example, use Frame as the top-level event instead of Frame 42 or use SubPass0 instead of SubPass0 [shader: 0xF01E62D6F4]. However, if the name is constant on a per-frame basis, inserting additional information can be helpful. For example, you can use DrawTrees [MaterialOak].
  • Do not use Direct3D 9 events (D3DPERF_BeginEvent and D3DPERF_EndEvent) in Direct3D 12, as this only measures the command list creation and not the execution.
  • In Direct3D 12 and Vulkan, avoid inserting begin/end marker pairs into different command lists (command buffers) if they can be executed in separate queues. You may get incorrect timestamps for workloads this way. A good reason for doing this, however, can be measuring idle times between execute command lists calls when synchronization with other queues takes place. 

    Command lists from different queue types can be executed in parallel. Their corresponding events shouldn’t have any hierarchical dependencies that rely on the execution order.

  • Do not annotate API-level primitives, such as copy and draw calls, fences, or resource transition barriers, as these can be automatically handled by profiling tools. However, you may still want to annotate single, heavy (full screen) draw calls.

GPU Performance Event Case Study

The following frame breakdown of Metro Exodus serves as an example of how to set up your GPU performance events.

Single frame from Metro Exodus for performance event example
Figure 1. GPU performance event frame example from Metro Exodus
Render_Frame
    Render_DownsampleHIST
    Render_Z-Fill
    Render_Scene
    Render_Details
        DET-DRAW
    Render_Terrain
        draw
    Render_Depth_2X_4X
    Render_SSAO
    Render_SSR-trace
    Render_CL-IBL
    Render_sun-smap_3
        geom
    Render_sun-smap_2
        geom
    Render_sun-smap_1
        geom
        procedurals
    Render_sun-smap_0
        geom
        procedurals
    Render_Distortion
    Render_L-Sun
    Render_L-GI
        Render_L-GI_accum
    Render_CL-LIGHT
    Render_VIA:compute
    Render_Forward_Prepare
    Render_Shade_Forward
        Render_Forward
    Render_Shade
        Render_Shade_Generic
    Render_Antialias
        prepare+DOF
        TAA
    Render_Bloom
Render_Post 

API

Across all graphics APIs, the definition of a GPU performance event is consistent and involves defining its range (begin/end marker pair) and assigning it a name.

Direct3D 11

INT ID3DUserDefinedAnnotation::BeginEvent(
    LPCWSTR Name
);
INT ID3DUserDefinedAnnotation::EndEvent(
);

You can open an event by calling ID3DUserDefinedAnnotation::BeginEvent and subsequently close it with ID3DUserDefinedAnnotation::EndEvent

ID3DUserDefinedAnnotation is the preferred solution for Direct3D 11. However, it requires Microsoft Windows 8 or the Platform Update for Windows 7 to be installed. Call ID3D11DeviceContext::QueryInterface to retrieve the ID3DUserDefinedAnnotation interface from device context.

Direct3D 12

void PIXBeginEvent(
  ID3D12CommandList* commandList, 
  UINT64 color, 
  char const* formatString, 
  ...
);

void PIXEndEvent(
  ID3D12CommandList* commandList
); 

In Direct3D 12, you can use PIXBeginEvent to open your event and finalize it with PIXEndEvent. We strongly recommend using the Microsoft PIX Events Runtime instead of the API functions (ID3D12GraphicsCommandList::BeginEvent and ID3D12GraphicsCommandList::EndEvent) directly.

We also recommend against using the following forms of PIX instrumentation functions (without command list or command queue arguments), as they do not map to underlying Direct3D functions. As such, they may not show up in all tools.

Warning: Do not use the runtime string formatting feature of these functions.

void PIXBeginEvent(UINT64 color, char const* formatString, ...)
void PIXBeginEvent(UINT64 color, wchar_t const* formatString, ...)
void PIXSetMarker(UINT64 color, char const* formatString, ...)
void PIXSetMarker(UINT64 color, wchar_t const* formatString, ...)

Vulkan

 void vkCmdBeginDebugUtilsLabelEXT(
  VkCommandBuffer             commandBuffer,
  const VkDebugUtilsLabelEXT *pMarkerInfo
);

void vkCmdEndDebugUtilsLabelEXT(
  VkCommandBuffer             commandBuffer,
); 

In Vulkan, events are referred to as labels. They should be defined on the command buffer using the VK_EXT_debug_utils extension. You create a label region with vkCmdBeginDebugUtilsLabelEXT and subsequently close it with vkCmdEndDebugUtilsLabelEXT.

OpenGL

void glPushDebugGroup(
  GLenum     source,
  GLuint     id,
  GLsizei    length,
  const char *message
);

void glPopDebugGroup(
  
); 

OpenGL applications can define events, referred to as debug groups, by calling glPushDebugGroup to set the beginning of the event and finalizing it with glPopDebugGroup. These functions are part of the OpenGL 4.3 standard. For earlier revisions, you can use the KHR_debug extension, which offers similar functionality.

Acknowledgements

I would like to thank the following NVIDIA colleagues for their valuable expertise and feedback on these GPU performance event best practices: Evgeny Makarov, Iain Cantlay, Juha Sjoholm, Louis Bavoil, Daniel Price, Jeffrey Kiel, Daniel Horowitz, Doron Ofek, Leroy Sikkes, and Mathias Schott.