Simulation / Modeling / Design

Tips and Tricks: Vulkan Dos and Don’ts

Note: This post was updated on 1/14/2025 to reflect updates.

The increased performance potential of modern graphics APIs is coupled with a dramatically increased level of developer responsibility. Optimal use of Vulkan is not a trivial concept, especially in the context of a large engine, and information about how to maximize performance is still somewhat sparse. The following document contains recommendations for best practices that achieve performance on NVIDIA hardware. It is not exhaustive, and is expected to be augmented over time, but should be a useful stepping stone for developers looking to utilize Vulkan to its full potential.

Engine Architecture

Do

  • Parallelize command buffer recording, image and buffer creation, descriptor set updates, pipeline creation, and memory allocation / binding. Task graph architecture is a good option that allows sufficient parallelism in terms of draw submission while also respecting resource and command queue dependencies.

Don’t

  • Don’t expect the driver to move processing of your Vulkan API commands to a worker thread. While the total cost of recording command buffers on Vulkan should be relatively low, the amount of work measured on the application’s thread may be larger due to the loss of driver threading. The more efficiently one can use parallel hardware cores of the CPU to record work in parallel, the greater the benefit in terms of draw call submission performance that can be expected.

Work Submission

Do

  • Build command buffers in parallel and evenly across several threads/cores. Recording commands is a CPU intensive operation, however multi-threading is a readily available solution that the API was designed for.
  • Be aware of the cost of setting up and resetting a command buffer. A reasonable number of command buffers are required for efficient parallel work submission.
  • Try to minimize the number of queue submissions. Each vkQueueSubmit() has a significant performance cost on CPU, so lower is generally better.
  • If command recording on the CPU is heavy, aggressive queue submission batching may result in additional latency penalty, while performance may remain the same. If latency is important for your application consider submitting GPU workloads earlier.
  • Functions such as vkAllocateCommandBuffers(), vkBeginCommandBuffer(), and vkEndCommandBuffer() should be called from the thread that fills the command buffer. These calls take measurable time on CPU and therefore should not be collected in a specific thread.
  • Check for gaps in execution on GPU using Nsight Systems, GPUView, or NvAPI_GPU_ClientRegisterForUtilizationSampleUpdates.
  • Reuse command buffers when possible. Secondary command buffers can be helpful here, depending on the workload – check carefully to determine if they are actually advantageous. Use VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT for command buffers that will be submitted only once. Use  VK_COMMAND_BUFFER_USAGE_SIMULTANEOUS_USE_BIT only when it’s really necessary, it may affect GPU performance.

Don’t

  • Don’t wait for a queue submission to finish, continue preparing the next queue submission or start preparing the next frame to avoid serialization
  • Don’t submit a small amount of GPU work. If a queue submission is processed on the GPU faster than new ones can be submitted on the CPU, it will result in wasted / idle GPU cycles.
  • Don’t record tiny command buffers that contain only a few small draw calls or small compute dispatches. Each command buffer contains some additional GPU work inserted by the driver (state reset and driver optimizations). Another reason is that compute work from one command buffer can’t overlap with compute work in subsequent command buffers, even if there are no barriers. It’s also important to take this into account when designing command buffer reuse.
  • Don’t overlap compute work on the graphics queue with compute work on a dedicated asynchronous compute queue on pre-Ampere GPUs. This may lead to gaps in execution of the asynchronous compute queue.

Pipeline

Do

  • Create pipelines asynchronously to rendering.
  • Use pipeline cache.
  • Use specialization constants. This may cause a possible decrease in the number of instructions and registers used by the shader.
  • Specialization constants can also be used instead of offline shader permutations to minimize the amount of bytecode that needs to be shipped with an application.
  • Minimize the number of vkCmdBindPipeline calls, each call has significant CPU cost and GPU cost. Consider sorting of drawcalls and/or using a low number of dynamic states.
  • Start using more general pipelines (with generic shaders that compile quickly) first and generate specializations later. This gets you up running faster even if you are not running the most optimal pipeline/shader yet.
  • Group draw calls, taking into account what kinds of shaders they use.
  • Changing the depth comparison function to the opposite value (for example from less to greater) disables Z-cull.
  • Switching tessellation, geometry, task and mesh shaders on/off is an expensive operation. Avoid frequently switching between pipelines that use different sets of pipeline stages.
  • Use identical sensible defaults for don’t care fields wherever possible. This creates more possibilities for pipeline reuse.

Don’t

  • Don’t expect speedup from Pipeline Derivatives.

Pipeline Layout

Do

  • Use push constants for per draw call updates of constants. However, the performance benefit depends on the amount of per draw call data as well as the amount of work per draw.
  • Use dynamic uniform/storage buffers for per draw call changes of uniform/storage buffers.
  • Try to keep the number of descriptor sets in pipeline layouts as low as possible.
  • Minimize the number of descriptors in the descriptor sets. Gaps between bindings in a descriptor set layout result in wasted memory in the descriptor set on GPU.

Command Pools and Buffers

Do

  • Reuse command pools for similarly sized sequences of draw calls.
  • Allocations are fast when the command buffer has been pre-warmed.
  • Use L * T + N pools. (L = the number of buffered frames, T = the number of threads that record command buffers, N = extra pools for secondary command buffers).

Don’t

  • Don’t create or destroy command pools, reuse them instead. This saves the overhead of allocator creation/destruction and memory allocation/free.
  • Don’t forget that command pools consume memory.

Memory Management

Do

  • Use memory sub-allocation. vkAllocateMemory() is an expensive operation on the CPU. Cost can be reduced by suballocating from a large memory object. Memory is allocated in pages that have a fixed size; sub-allocation helps to decrease the memory footprint.
  • Use VK_EXT_memory_budget to query video memory budget for the process from the OS memory manager. It’s important to keep usage below the budget to avoid stutters caused by demotion of videomemory allocations.
  • When memory is over-committed on Windows, the OS memory manager may move allocations from video memory to system memory, the OS also may temporarily suspend a process from the GPU runlist in order to page out its allocations to make room for a different process’ allocations. There is no OS memory manager on Linux that mitigates over-commitment by automatically performing paging operations on memory objects.
  • Use VK_EXT_pageable_device_local_memory to avoid demotion of critical resources by assigning memory priority. It’s also a good idea to set low priority to non-critical resources such as vertex and index buffers; the app can verify the performance impact by placing the resources in system memory. 
  • Use VK_EXT_pageable_device_local_memory to also disable automatic promotion of allocations from system memory to video memory.
  • Use dedicated memory allocations (VK_KHR_dedicated_allocation, core in VK 1.1) when appropriate.
  • Using dedicated memory may improve performance for color and depth attachments, especially on pre-Turing GPUs.
  • Use VK_KHR_get_memory_requirements2 (core in VK 1.1) to check whether an image/buffer requires dedicated allocation.
  • Use host visible video memory to write data directly to video memory from the CPU. Such heap can be detected using DEVICE_LOCAL_BIT | HOST_VISIBLE_BIT. Take into account that CPU writes to such memory may be slower compared to normal memory. CPU reads are significantly slower. Check BAR1 traffic using Nsight Systems for possible issues.
  • Explicitly look for the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT when picking a memory type for resources, which should be stored in video memory.

Don’t

  • Don’t assume fixed heap configuration, always query and use the memory properties using vkGetPhysicalDeviceMemoryProperties().
  • Don’t assume memory requirements of an image/buffer, use vkGet*MemoryRequirements(). 
  • Don’t put every resource into a Dedicated Allocation.
  • For memory objects that are intended to be in device-local, do not just pick the first memory type. Pick one that is actually device-local.

Resources

Do

  • Always use VK_IMAGE_TILING_OPTIMAL.
  • Copy both depth and stencil to avoid a slow path for copying.
  • Prefer using D24_UNORM_S8_UINT or D32_SFLOAT depth formats, D32_SFLOAT_S8_UINT is not optimal.
  • VkSharingMode is ignored by the driver, so VK_SHARING_MODE_CONCURRENT incurs no overhead relative to VK_SHARING_MODE_EXCLUSIVE.

Don’t

  • VK_IMAGE_TILING_LINEAR is not optimal. Use a staging buffer and vkCmdCopyBufferToImage() to update images on the device.

Barriers

Do

  • Minimize the use of barriers. A barrier may cause a GPU pipeline flush. We have seen redundant barriers and associated wait for idle operations as a major performance problem for ports to modern APIs.
  • Prefer a buffer/image barrier rather than a memory barrier to allow the driver to better optimize and schedule the barrier, unless the memory barrier allows to merge many buffer/image barriers together.
  • Use VK_KHR_synchronization2, the new functions allow the application to describe barriers more accurately.
  • Group barriers in one call to vkCmdPipelineBarrier2(). This way, the worst case can be picked instead of sequentially going through all barriers.
  • Use optimal srcStageMask and dstStageMask. Most important cases: If the specified resources are accessed only in compute or fragment shaders, use the compute or the fragment stage bits for both masks, to make the barrier fragment-only or compute-only.
  • Use VK_IMAGE_LAYOUT_UNDEFINED when the previous content of the image is not needed.
  • Use vkCmdSetEvent2 and vkCmdWaitEvents2 to issue an asynchronous barrier to avoid blocking execution.
  • Make sure to always use the minimum set of resource usage flags. Redundant flags may trigger redundant flushes and stalls in barriers and slow down your app unnecessarily.

Don’t

  • Don’t insert redundant barriers; this limits parallelism; avoid read-to-read barriers.

Debugging

Do

  • Use the validation layer. The validation layer can flag many errors in the command stream, which can help avoid bugs in your application.
  • Use VK_EXT_debug_utils to annotate command buffer regions and assign debug names to resources. Tools like Nsight and RenderDoc use this information.
  • During development, register a debug callback by using VK_EXT_debug_utils. The driver calls this for various non-performance critical validation checks it might perform.
  • NVIDIA profiling tools utilize the debug markers, so it’s recommended to keep region annotations even in release builds. The single check for the existence of the extension is negligible per region.
  • Lock GPU clocks to make GPU time measurements more stable. Use nvidia-smi or SetStablePowerState() from D3D12 API, it requires administrator rights, but it works globally.

Don’t

  • Don’t test performance with validation layers enabled.
  • Don’t test GPU performance when application performance is CPU-limited.

More Information

You can find additional information about using Vulkan with NVIDIA GPUs in Introduction to Real-Time Ray Tracing with Vulkan, Turing Extensions for Vulkan and OpenGL, and Path Tracing for Quake II in Two Months

Discuss (6)

Tags