We are introducing the VK_NVX_device_generated_commands (DGC) Vulkan extension, which allows the GPU to generate the most frequent rendering commands on its own. Vulkan, therefore, now has functionality similar to DirectX12’s ExecuteIndirect, however we added the ability to change shaders (pipelines) on the GPU as well. This means for the first time an open graphics API provides functionality to create compact command buffer streams on the device avoiding the worst-case state setup of previous indirect drawing methods.

Motivation

With the general advances in programmable shading, the GPU can take over an ever increasing set of responsibilities for rendering, by computing supplemental data and allowing a greater variety in rendering algorithms to be implemented. However, when it came to setting up the state for draw or compute calls, the decisions mostly had to be made on the CPU. Therefore, explicit synchronization or working from past frame’s results was necessary. With device-generated commands this added latency can be removed and existing inefficiencies overcome.

Some usage scenarios that can benefit from this functionality:

  • Occlusion culling: Use custom shader-based culling techniques to allow greater accuracy with easy access to the latest depth-buffer information.
  • Object sorting: In forward shading limited scenarios, sort the objects front to back to achieve a greater performance.
  • Level-of-detail: Influence the shaders or geometry used for an object, depending on its screen-space footprint.
  • Work distribution for improved efficiency: Bucket arbitrary workloads for improved coherency in their resource usage, for example by using more specialized shaders that have compile-time optimizations applied or binding an appropriate uniform buffer.

Evolution of GPU Work Creation

The development of these extensions follows a trend in recent years to increase the capability for the GPU to generate its own work. The first such capability was Draw Indirect, where just the basic parameters for a draw call (number of primitives, instances, etc.) could be sourced from GPU memory. Next came the very popular Multi Draw Indirect (MDI) feature that is practical for both CPU time reasons as well as GPU culling or level-of-detail algorithms. Following this, our own OpenGL GL_NV_command_list extension added the ability to do basic state changes (e.g. vertex/index/uniform-buffers) via tokens stored in regular GPU buffers, which could be manipulated through shaders just like MDI. Similar functionality was later exposed in DirectX 12 as Execute Indirect, and is now available in Vulkan.

However, there are issues with existing approaches to GPU work creation. For example, command buffers have to be prepared for a worst-case scenario on the CPU, and then rely on empty draw or dispatch indirect commands to skip GPU work. This can cause inefficiencies in the GPU’s command processor, as it may not detect state setup which is later deemed superfluous. It may also cause over-fetching command data that is not ultimately required. Therefore in this extension, we have added further improvements, such as the ability to switch pipelines.

Principle Usage

We introduce two new object types in Vulkan:

  • VkObjectTableNVX: The object table makes the resources/bindings accessible for the device generated commands. We cannot use the CPU pointers for the various objects as “function arguments” on the device, so instead we use the combination of the table and a uint32_t index into the table. The developer provides the table size for each resource type.
  • VkIndirectCommandsLayoutNVX: The layout encodes the sequence of commands that will be generated. In addition to the command tokens and their functional arguments that need to be known at layout-creation time (binding sets, usage of dynamic offsets, etc.), additional usage flags that affect the generation can also be provided.

These steps need to be taken to generate commands on the device:

  • Make resource bindings (VkBuffer, vkPipeline, VkDescriptorSet…) accessible for the device by registering them in an VkObjectTableNVX at an arbitrary, user managed index.
  • Define the sequence of commands which should be generated, as well as some of their arguments (e.g. binding set, dynamic offset usage etc.), through the VkIndirectCommandsLayout object.
  • Fill one or more VkBuffer with the appropriate content that gets interpreted by VkIndirectCommandsLayout (for bindings, the object table indices are used, and one can also provide dynamic buffer offsets as uint32_t values).
  • Reserve command space via vkCmdReserveSpaceForCommandsNVX in a secondary VkCommandBuffer where the generated commands will be recorded.
  • Generate the actual commands via vkCmdProcessCommandsNVX, passing in all required data. The command buffer that runs the processing can be different from the target command buffer.
  • Execute the commands by passing the target command buffer to vkCmdExecuteCommands.
  • Use the VK_PIPELINE_STAGE_COMMAND_PROCESS_BIT_NVX barrier to ensure the command buffer space is available for writing or reading.
  • It is recommended that the number of resources registered in the object table be minimized, and that dynamic offsets be static instead, if possible.

Key Design Choices

Compared to the approaches in GL_NV_command_list or DirectX12’s ExecuteIndirect some different design choices were made:

  • Separated Generation from Execution: This extension separates generation of commands from their execution. While the generating command can behave similar to ExecuteIndirect and execute directly as well, it also allows the commands to be stored into the reserved space of a different target command buffer.
    • Explicit Command Space Reservation: The reservation of space is explicitly handled and behaves like a traditional rendering command, that means the existing rendering state up through the reservation command (render pass, etc.) is inherited.
    • Re-useable Generated Command Buffers: A generated command buffer behaves like a CPU recorded secondary command buffer and can be executed multiple times. Furthermore, its space can be reused for new command generation, as long as certain criteria concerning its allocation are met.
  • Stateless Command Sequences: While the new design also uses flexible command tokens, they define a stateless sequence. The GL_NV_command_list extension encoded a completely serial token stream, which allowed for the developer to resolve redundant state setup and create compact streams, however it is not as portable or parallel-processing friendly as the traditional stateless design of MDI or ExecuteIndirect. Not all command types need to be generated, although stateful commands must be defined prior to the work provoking ones.

    VkIndirectCommandsTokenTypeNVX Equivalent vkCmd Property
    VK_INDIRECT_COMMANDS_TOKEN_PIPELINE_NVX vkCmdBindPipeline stateful
    VK_INDIRECT_COMMANDS_TOKEN_DESCRIPTOR_SET_NVX vkCmdBindDescriptorSets stateful
    VK_INDIRECT_COMMANDS_TOKEN_INDEX_BUFFER_NVX vkCmdBindIndexBuffer stateful
    VK_INDIRECT_COMMANDS_TOKEN_VERTEX_BUFFER_NVX vkCmdBindVertexBuffers stateful
    VK_INDIRECT_COMMANDS_TOKEN_PUSH_CONSTANT_NVX vkCmdPushConstants stateful
    VK_INDIRECT_COMMANDS_TOKEN_DRAW_INDEXED_NVX vkCmdDrawIndexedIndirect stateless
    VK_INDIRECT_COMMANDS_TOKEN_DRAW_NVX vkCmdDrawIndirect stateless
    VK_INDIRECT_COMMANDS_TOKEN_DISPATCH_NVX vkCmdDispatchIndirect stateless

  • Input Data as Structure of Arrays (SoA): Previous approaches used a single input buffer stream. MDI and ExecuteIndirect are designed as array of structures (AoS) and made partial updates less cache efficient and did not allow for separating dynamic from static content, or re-using the latter easily. DGC uses structure of arrays, so each command token input can be stored in its own buffer as compact cache-friendly stream.
    • Variable Input Divisor: One benefit of using SoA is that each input array can be indexed at a different rate: buffer[ i / divisor]. This way several draw calls can reference the same data more easily.

  • Custom Sequence Ordering: By default all sequences are processed in the order they are provided. The developer is given the option to relax the ordering requirement to be non-coherent, i.e.implementation dependent. Alternatively, the ordering can be provided through another buffer.
    • Sequence Count & Index Buffer: The number of sequences as well as the subset can be provided by additional buffers containing uint32_t values. This allows for easy sorting of draw calls without having to sort the input data.

What Is the Catch?

No feature is free of trade-offs. The device generation approach means that some driver-side optimizations may not apply. Furthermore, the generation process will add to the frame time, whilst the CPU is able to record commands without affecting the GPU time. Last but not least, making the resource bindings available on the GPU, as well as reserving the command buffer space for worst-case execution, does require additional GPU memory.

In summary, the goal of this extension is primarily reducing the amount of actual work done on the GPU, and not off-loading command generation to the GPU in general.

The extension is purposefully labeled as experimental (NVX, not NV) in order to gather early feedback from developers and researchers who want to play with these new features. We hope to eventually follow-up with a refined extension that could also be supported by other vendors. Being experimental and very new, there will be TDRagons!

Show Me the Code!

The threaded CAD scene DesignWorks sample was updated to provide a basic look into how the extension works. Stay tuned for further updates on this extension and its availability.


Special thanks to Mathias Schott, Daniel Koch, James Helferty and Markus Tavenrath who contributed to this article.