GPU performance events can be used to instrument your game by labeling regions and marking important occurrences. A performance event represents a logical, hierarchical grouping of work, consisting of a begin/end marker pair. There are best practices for GPU performance events that are universally used by profiling tools such as NVIDIA Nsight Graphics and NVIDIA Nsight Systems, for navigating through complex frame rendering.
While all modern graphics APIs (Direct3D 11, Direct3D 12, Vulkan, and OpenGL 4.3) offer a simple solution to set these begin/end performance markers, they do not enforce the conventions that profiling tools follow. When you follow these best practices, your game works better with profiling tools and it is easier for NVIDIA engineers to help you optimize your game.
Do’s for GPU Performance Events
- Consider enabling GPU performance events in all builds, including final releases, as this results in no significant CPU overhead (at least when issuing less than 200 markers per frame). It also requires zero GPU overhead when no GPU profilers are enabled, and can be helpful for performance and stability analysis. There is no need to enable internal profiling at the same time, though. We recommend decoupling the marker generation from any internal profiling code paths.
- In multi-threaded environments, set the events in the same thread that’s used to record the GPU command lists (command buffers). This provides better visibility during frame profiling, as it also shows CPU utilization for performing various graphics tasks.
- Because there can be multiple frames in-flight in modern graphics APIs, and frame rendering has become increasingly complex, you can’t always determine the boundaries of a single frame. Use a
Frameevent that encompasses all other events. Exclude the
Presentcall from this event, as it may trigger additional work at the operating system level and potentially introduce noise.
- Make sure to provide a reasonable set of events that is just enough to understand each individual workload. For an example, see the Case Study.
- For small workloads that consume less than 2-3% of your frame time, avoid having dedicated events rooted directly at the top-level frame. Group them into a hierarchy.
- Consider adding different verbosity levels to your marker generation. Different tools do different types of work per marker, and this allows you to dial in the level of detail that you need per tool. While you may want full verbosity during frame debugging and frame profiling, you may want fewer levels for stutter analysis.
- Prefer shallow layouts over complex nested schemes. Complex schemes can increase the analysis complexity and introduce more overhead, while not measurably adding more value.
- If your application supports multiple graphics APIs, try to use the same event names. This allows certain profiling tools to do cross-API comparisons.
Don’ts for GPU Performance Events
- Avoid incomplete events. When viewed from the linear series of submissions to a single queue, the calls to begin and end the event must be matched and balanced. Incomplete events can cause difficulties for profiling tools.
- Avoid event names that do not remain consistent between frames or application runs. Do not use dynamically generated names (for example, runtime string construction), as it prevents static analysis tools from correlating the events. For example, use Frame as the top-level event instead of Frame 42 or use SubPass0 instead of SubPass0 [shader: 0xF01E62D6F4]. However, if the name is constant on a per-frame basis, inserting additional information can be helpful. For example, you can use DrawTrees [MaterialOak].
- Do not use Direct3D 9 events (
D3DPERF_EndEvent) in Direct3D 12, as this only measures the command list creation and not the execution.
- In Direct3D 12 and Vulkan, avoid inserting begin/end marker pairs into different command lists (command buffers) if they can be executed in separate queues. You may get incorrect timestamps for workloads this way. A good reason for doing this, however, can be measuring idle times between execute command lists calls when synchronization with other queues takes place.
Command lists from different queue types can be executed in parallel. Their corresponding events shouldn’t have any hierarchical dependencies that rely on the execution order.
- Do not annotate API-level primitives, such as copy and draw calls, fences, or resource transition barriers, as these can be automatically handled by profiling tools. However, you may still want to annotate single, heavy (full screen) draw calls.
GPU Performance Event Case Study
The following frame breakdown of Metro Exodus serves as an example of how to set up your GPU performance events.
Render_Frame Render_DownsampleHIST Render_Z-Fill Render_Scene Render_Details DET-DRAW Render_Terrain draw Render_Depth_2X_4X Render_SSAO Render_SSR-trace Render_CL-IBL Render_sun-smap_3 geom Render_sun-smap_2 geom Render_sun-smap_1 geom procedurals Render_sun-smap_0 geom procedurals Render_Distortion Render_L-Sun Render_L-GI Render_L-GI_accum Render_CL-LIGHT Render_VIA:compute Render_Forward_Prepare Render_Shade_Forward Render_Forward Render_Shade Render_Shade_Generic Render_Antialias prepare+DOF TAA Render_Bloom Render_Post
Across all graphics APIs, the definition of a GPU performance event is consistent and involves defining its range (begin/end marker pair) and assigning it a name.
INT ID3DUserDefinedAnnotation::BeginEvent( LPCWSTR Name ); INT ID3DUserDefinedAnnotation::EndEvent( );
ID3DUserDefinedAnnotation is the preferred solution for Direct3D 11. However, it requires Microsoft Windows 8 or the Platform Update for Windows 7 to be installed. Call ID3D11DeviceContext::QueryInterface to retrieve the ID3DUserDefinedAnnotation interface from device context.
void PIXBeginEvent( ID3D12CommandList* commandList, UINT64 color, char const* formatString, ... ); void PIXEndEvent( ID3D12CommandList* commandList );
In Direct3D 12, you can use
PIXBeginEvent to open your event and finalize it with
PIXEndEvent. We strongly recommend using the Microsoft PIX Events Runtime instead of the API functions (
We also recommend against using the following forms of PIX instrumentation functions (without command list or command queue arguments), as they do not map to underlying Direct3D functions. As such, they may not show up in all tools.
Warning: Do not use the runtime string formatting feature of these functions.
void PIXBeginEvent(UINT64 color, char const* formatString, ...) void PIXBeginEvent(UINT64 color, wchar_t const* formatString, ...) void PIXSetMarker(UINT64 color, char const* formatString, ...) void PIXSetMarker(UINT64 color, wchar_t const* formatString, ...)
void vkCmdBeginDebugUtilsLabelEXT( VkCommandBuffer commandBuffer, const VkDebugUtilsLabelEXT *pMarkerInfo ); void vkCmdEndDebugUtilsLabelEXT( VkCommandBuffer commandBuffer, );
In Vulkan, events are referred to as labels. They should be defined on the command buffer using the VK_EXT_debug_utils extension. You create a label region with
vkCmdBeginDebugUtilsLabelEXT and subsequently close it with
void glPushDebugGroup( GLenum source, GLuint id, GLsizei length, const char *message ); void glPopDebugGroup( );
OpenGL applications can define events, referred to as debug groups, by calling glPushDebugGroup to set the beginning of the event and finalizing it with glPopDebugGroup. These functions are part of the OpenGL 4.3 standard. For earlier revisions, you can use the KHR_debug extension, which offers similar functionality.
I would like to thank the following NVIDIA colleagues for their valuable expertise and feedback on these GPU performance event best practices: Evgeny Makarov, Iain Cantlay, Juha Sjoholm, Louis Bavoil, Daniel Price, Jeffrey Kiel, Daniel Horowitz, Doron Ofek, Leroy Sikkes, and Mathias Schott.