Fault Handling and Logging#

This topic explains how NVIDIA’s Vulkan SC fault handling and logging work in production environments and what application developers should expect.

Preventing Faults#

Important: Applications must not cause Vulkan SC to report faults at VK_FAULT_LEVEL_CRITICAL during normal operation. The absence of faults indicates proper API usage and system operation.

When all of the following conditions are met:

Applications correctly use the Vulkan SC API.
NVIDIA DriveOS SEooC operates normally.
NVIDIA DRIVE AGX Orin™ and DRIVE AGX Thor™ SoC function as expected.

The Vulkan SC implementation is designed to report zero faults through the fault handling interface. This zero-fault state increases confidence that the application is correctly using the Vulkan SC API.

During development and testing, occasional faults may occur as part of the normal debugging process. These should be addressed before deployment to production environments.

Why Critical Faults Are Problematic#

Critical faults indicate potential uncorrectable errors in the iGPU. When a VK_FAULT_LEVEL_CRITICAL occurs:

The affected VkDevice becomes lost (as described in Vulkan SC Specification section 5.2.3).
API functions that submit or wait for VkQueue will return VK_ERROR_DEVICE_LOST.
On Orin and Thor , the system’s error response affects:
- Other VkDevice instances, even in separate processes.
- CUDA devices, which will also return errors.
System availability is degraded.
For the Safety DriveOS platform, the NvGPU’s ASIL software resource manager reports an asynchronous error to Safety Services.

During development, if faults occur, developers should investigate and correct their Vulkan SC API usage. The VK_NV_private_vendor_info extension provides a VkFaultDataDescriptionNV structure with description strings to assist debugging.

Required Fault Monitoring#

Applications must implement fault monitoring in automotive deployments through these three steps:

Register a callback when creating each VkDevice:
- Provide a PFN_vkFaultCallbackFunction in VkFaultCallbackInfo::pfnFaultCallback.
- Include this in the pNext chain of VkDeviceCreateInfo.
Call vkGetFaultData for each VkDevice after completing all initialization API calls.
Check regularly by calling vkGetFaultData within one second after each vkQueueSubmit.

This monitoring is necessary because many Vulkan SC API functions return void rather than a VkResult. (The Khronos Group designed this for CPU efficiency to reduce branches in performance-critical code paths. The fault handling interface provides error reporting for these functions.)

Handling Recoverable Faults#

Applications should not generate faults even at lower criticality levels (VK_FAULT_LEVEL_RECOVERABLE, VK_FAULT_LEVEL_WARNING, or VK_FAULT_LEVEL_UNASSIGNED) in production. This prevents error accumulation and multi-point failures.

Recommended handling for VK_FAULT_LEVEL_RECOVERABLE faults:

Command buffer recording faults (VK_FAULT_TYPE_COMMAND_BUFFER_FULL or VK_FAULT_TYPE_INVALID_API_USAGE):
- Expect vkEndCommandBuffer to return an error.
- Clear the error state with vkResetCommandPool.
- Note that subsequent vkCmd functions will be silently ignored until reset.
Non-command recording API faults (VK_FAULT_TYPE_INVALID_API_USAGE):
- The function will skip its normal behavior.
- Retrying with the same parameters will repeat the fault.
- Applications should use alternate behavior.

Fault Handling Limitations#

The fault handling API is tied to a valid VkDevice, which means faults cannot be reported when:

Functions don’t involve a VkDevice (e.g., VkInstance operations).
The parameter containing the VkDevice or related object is invalid.
During vkDestroyDevice.

Functions that return a VkResult can communicate failures through result codes. However, many functions return void and thus cannot report failures directly through the API. These failures can be detected using:

The Validation Layer during development.
A review of the DriveOS system log (all profiles).

System Logging#

All detected faults and errors are logged to the system log file. Since the system log is a limited shared resource, applications must avoid causing Vulkan SC to generate logging messages during normal operation.

The system log captures failures that aren’t reported through the fault handling interface. Application developers must inspect the log file to ensure no failures will occur during normal operation, rather than relying solely on fault handling and error result values.