Tuning#

This topic advises how to measure and optimize the performance of Vulkan SC during Initialization and Normal Operation states of NVIDIA DriveOS. The following sections guide how to minimize resource usage of GPU execution, CPU utilization, and memory footprint.

VkDeviceMemory Tuning#

To optimize for low and consistent latency of Initialization, applications should call vkAllocateMemory with VkMemoryAllocateInfo::allocationSize of at least 8 MB. Depending on the automotive use case of each application, initialization latency can be critical.

Applications can then call vkBindImageMemory and vkBindBufferMemory to bind multiple VkImage and VkBuffer within each VkDeviceMemory. This applies the concept of sub-allocation. For general guidance on sub-allocation in Vulkan, see Vulkan Memory Management-document [4].

The guidance models the latency of each call to vkAllocateMemory and vkBindBufferMemory as linear equation of constant overhead summed with a factor of the memory size. These linear relationships are heuristics for the timing on NVIDIA DRIVE AGX Orin, at one specific CCPLEX clock frequency. This guide encourages application developers to measure and profile initialization, and tune for their use cases.

Developers can estimate latency_us_vkAllocateMemory, the average latency in microseconds for each call of vkAllocateMemory, depending on the requested size in bytes.

latency_us_vkAllocateMemory =  240 + (0.0001  * VkMemoryAllocateInfo::allocationSize)

Developers can estimate latency_us_vkBindImageMemory, the average latency in microseconds for each call of vkBindImageMemory, depending on the required size in bytes for the input VkImage.

latency_us_vkBindImageMemory = 105 + (0.00002 * VkMemoryRequirements::size)

To optimize performance with the future DRIVE Thor SoC, for both optimal GPU read and write throughput and efficient utilization of memory space, the preliminary guidance is to allocate VkDeviceMemory in size that is multiples of 2 MB. This guidance is consistent to the current Orin guidance, to allocate VkDeviceMemory in size of at least 8 MB.

Characterize Worst-Case Execution Time#

The below guidance applies when application software uses Vulkan SC and its queue submissions are interleaved by time slicing on the same iGPU that processes CUDA compute with safety requirements on its timing.

You should budget the GPU time taken to execute Vulkan SC commands to complete within a Worst Case Execution Time (WCET). You should characterize the Worst Case Execution Time during development of your application.

The GPU worst-case workload for your application is defined as the sequence of Vulkan SC commands, and their associated parameters and data, which your application can record in VkCommandBuffer objects, and which result in the longest potential execution duration (WCET) on GPU when your application submits those commands buffers via vkQueueSubmit, as measured from start of the submissions’ execution to their completion. You can initially measure and verify WCET as 99.99% percentile longest duration, measured on an otherwise idle system. You should target the application’s GPU timeslice budget with an additional margin of GPU time. When modeling during development, you should measure test cases of the worst case workloads, which consume time on GPU execution closest to your scheduled and margined Worst Case Execution Time.

You should consider if the GPU time depends on the “virtual camera perspective” in a 3D scene of your application. For example, if the values in a transformation matrix within a VkBuffer of VK_BUFFER_USAGE_UNIFORM_BUFFER_BIT can cause iGPU to invoke an expensive VkGraphicsPipeline for more fragments in a render pass. You can limit the range of “virtual camera perspective” which your application permits.

When deployed to production, you should limit any dynamic workload in your application that might exceed this timeslice. For example, if the parameters to vkCmdDrawIndex, vkCmdDraw, and so on depend on input data from sources external to your application, your application should limit (ceiling) the maximum numbers of instances, vertices, indices, and limit the count of draw commands in each queue submission to a maximum value.

You can use vkCmdWriteTimestamp in your applications to query the timing of GPU execution asynchronously, at command granularity and modest throughput overhead. Also, for VkSemaphore which import NvSciSyncObj, you can read timestamps of the semaphore signal operations at batch granularity via NvSciSync_api, when enabled via NvSciSyncAttrKey_WaiterRequireTimestamps.

The rationale for you to characterize and budget for WCET, is that if Vulkan SC submissions on iGPU complete later than scheduled, on Orin and QNX Safety that can delay CUDA compute tasks with safety requirements on their timing. See also the “9.2 Application-Level Scheduling” section in the NVIDIA DriveOS 6.0 SAFETY MANUAL. Vulkan SC executes on iGPU at QM level. You can expect multiple logical devices, which CUDA and Vulkan SC instantiate, to share an iGPU via time-slicing and context-switching. GPU preemption cannot pause graphics processing within hard-real-time latency in the worst case. If your application’s submissions to its VkQueue overrun its allotted GPU time budget too frequently or by too much of a delay, your application’s VkDevice can receive VK_ERROR_DEVICE_LOST.

Polling Operations#

To make the most of CPU and GPU time in Vulkan SC, as well as for reasons of CPU power efficiency, users of Vulkan SC should minimize the amount of time the Vulkan SC driver spends polling the status of the GPU.

It is not always obvious which Vulkan SC functions will poll, so this section lists all functions where the driver blocks the CPU thread to wait on the GPU and identifies under which circumstances the driver implements that block by polling in a loop.

Function

Description of synchronization

  • vkQueueWaitIdle

  • vkDeviceWaitIdle

  • vkCreateDevice

  • vkDestroyDevice

These functions all wait for the Device’s Queue to drain and become idle. For vkCreateDevice, this wait occurs after initialization; for vkDestroyDevice, this occurs before deinitialization. These functions do not use a polling CPU loop in the driver.

  • vkGetQueryPoolResults

Only waits when Flags includes VK_QUERY_RESULT_WAIT_BIT. This wait uses a polling CPU loop in the driver. Where possible, it is recommended that users structure synchronization such that the application waits on a VkFence instead of directly waiting on the VkQueryPool.

  • vkWaitForFences

Waits until all or any of the given VkFences have been reached.

Specifying any VkFences necessitates a polling CPU loop in the driver over the VkFences. Where possible, it is recommended to structure Waits so they are a wait for all , which allows the CPU to idle.

Note

If the user cannot avoid using an operation that will result in a polling CPU loop, the design recommendation is to avoid using a timeout parameter larger than 100 (one hundred) microseconds and avoid calling one of these functions more frequently than once every 100 (one hundred) milliseconds.

  • vkWaitSemaphores

Waits until all or any of the given VkSemaphores have been reached.

This function always necessitates a polling CPU loop in the driver over the given VkSemaphores. Where possible, it is recommended to structure synchronization such that the CPU waits on VkFences instead of VkSemaphores.

Note

If the user cannot avoid using an operation that will result in a polling CPU loop, the design recommendation is to avoid using a timeout parameter larger than 100 (one hundred) microseconds and avoid calling one of these functions more frequently than once every 100 (one hundred) milliseconds.

  • vkQueueSubmit

This function does not wait unless the submission queue is full.

If the submission queue is full, the driver will wait for the GPU to process enough previous submissions so that it has enough room to fit the new request in the submission queue. This wait does not necessitate a polling CPU loop in the driver.

It is nonetheless recommended that users avoid filling the submission queue to the point where it is full. Any combination of the following strategies may be used to avoid this condition:

  • Reduce the number of command buffers submitted to the VkQueue in close temporal proximity, by recording the same commands into a smaller number of larger command buffers.

  • Delay submission of additional work to the queue until the GPU has a chance to complete work that was submitted earlier. Non-blocking use of synchronization primitives can be used to monitor the queue’s progress. (This is recommended if the user has other activities it can perform on the CPU instead of blocking waiting for space in the submission queue.)

  • Increase the size of the submission queue by using a Calibration Data Set approved by NVIDIA’s Vulkan SC team with a larger submission queue value.

In addition to the timeout value that some of these functions take directly, Vulkan SC has a global timeout value, after which the function will return an error. This global timeout value is part of Vulkan SC Calibration Data Set that can be calibrated via VK_NV_private_vendor_info. (See key value VK_APPLICATION_PARAMETER_KEY_INTERVAL_UNTIL_DEVICE_LOST_DURING_WAITS_NV.)

Profiling and General References#

The graphics performance tuning advice that this guide provides above is not complete for Vulkan SC on Orin. You can further control and get visibility into the graphics performance of your application, which uses Vulkan SC.

You can use Nsight Graphics and Nsight Systems tools to profile and visualize your application and its workload to Vulkan SC, on DriveOS QNX NSR and DriveOS Linux NSR Platforms. With the Vulkan SC loader and layers on NSR platforms, you can annotate your application’s workload for these and more tools with VK_EXT_debug_utils extension and the vkCmdBeginDebugUtilsLabelEXT and vkCmdEndDebugUtilsLabelEXT extension functions.

Below are additional references. The performance guidance found in these documents is applicable to Vulkan SC, to the extent that it identifies best practices for organizing and optimizing graphics work on the GPU.

You can consider the performance recommendations provided for OpenGL ES in Programming Efficiently, particularly within the Shader Programs, Textures, and Geometry topics. There are API differences between OpenGL ES and Vulkan SC.

You can consider NVIDIA Best Practices in Vulkan Validation. You can validate your application against these, if you can run your application on Vulkan 1.2, and enable VK_DEBUG_UTILS_MESSAGE_TYPE_PERFORMANCE_BIT_EXT. These best practices address performance of Vulkan 1.2 and Vulkan 1.3 for discrete NVIDIA GPU devices and general-purpose operating systems. Vulkan SC on Orin has different characteristics due to the integrated NVIDIA GPU device, and the lessened degree that this Vulkan SC implementation dynamically optimizes your application workload. You can consider below subset of best practices, which should apply to Vulkan SC on Orin.

  • Identifier

Message.

  • kVUID_BestPractices_CreateImage_TilingLinear

Use VK_IMAGE_TILING_OPTIMAL instead of VK_IMAGE_TILING_LINEAR.

  • kVUID_BestPractices_Submission_ReduceNumberOfSubmissions

Submitting command buffers have a CPU and GPU overhead. Submit fewer times to incur less overhead.

  • kVUID_BestPractices_Pipeline_SortAndBind

Keep pipeline state changes to a minimum, for example, by sorting draw calls by pipeline.

  • kVUID_BestPractices_ClearColor_NotCompressed

This can be fixed using a clear color of VkClearColorValue{0.0f, 0.0f, 0.0f, 0.0f} or VkClearColorValue{1.0f, 1.0f, 1.0f, 1.0f}.

Note

Contact NVIDIA for details of additional optimal VkClearColorValue configured on QNX Safety Platform.

  • kVUID_BestPractices_CreateImage_Depth32Format

Use VK_FORMAT_D24_UNORM_S8_UINT or VK_FORMAT_D16_UNORM, unless the extra precision of 32-bit depth format is needed.

  • kVUID_BestPractices_CreatePipelinesLayout_LargePipelineLayout

If “Pipeline layout size is too large, prefer using pipeline-specific descriptor set layouts.”

  • kVUID_BestPractices_CreatePipelineLayout_SeparateSampler

Consider using combined image samplers instead of separate samplers for marginally better performance.

Consistent GPU Scheduling#

The GPU scheduling latency starts when your application calls vkQueueSubmit until that batch of VkCommandBuffer objects executes on the iGPU. The GPU scheduling latency precedes the execution duration of the submissions.

To optimize for consistent GPU scheduling latency with high GPU utilization, your application should synchronize all vkQueueSubmitcalls to different VkQueue, during the runtime state, using the signal and wait-for synchronization provided by VkSemaphore, to total order all command buffer submissions. Additionally, because CUDA and Vulkan SC context-switch to share the iGPU, synchronize each submission to each VkQueue in a total order with each CUDA stream.

If, instead, you call vkQueueSubmit concurrently without synchronization to VkQueue objects, potentially from multiple VkDevice objects, then the latency until each work submission completes, and the duration of that GPU work, become non-deterministic. The completion order of the pushbuffer submissions to concurrent VkQueue also becomes non-deterministic, as the submits are not synchronized. If multiple GPU channels run concurrently, then the round robin hardware scheduler can activate and preempt contexts and context-switch them without software intervention. In that event, the latency and duration perceived for an individual VkQueue can become longer. Still, for the entire system, the efficiency of GPU throughput remains high even if timeslicing occurs.

For additional information, refer to “Application-Level Scheduling” in the NVIDIA DriveOS 7.0 Safety Manual.