Safety Framework and Error Reporting#

Faults detected by hardware safety mechanisms in Orin are reported to the HSM module in the FSI.

Some of the software detected errors are also reported to the FSI through software based error reporting mechanisms.

Reported errors result in an interrupt to the Cortex-R52 core in FSI and FSI receives the reported error (though the interrupt mechanism is different between hardware reported errors and software reported hardware errors).

Software running on FSI is responsible for communication of failure to MCU via SPI and SOC ERROR PIN.

Hardware Software Interface (HIS) for ORIN defines the actions to be performed by software to initialize, enable mechanism to detect hardware errors, either directly by hardware or with software support.

More details on error reporting can be found in the Error Reporting section under Safety Extension.

Details on the list of errors supported in DriveOS Linux are available as part of the Error ID Reference document.

The following sections detail specific interfaces.

PCIe#

Error Reporting : Uncorrected Errors

For PCIe uncorrected errors are reported only through the Software based reporting(EPD). Once an uncorrected error occurs on the controller the specific instance of the controller is deemed unreliable. Hence reporting of the Uncorrected error happens only once for a specific PCIe instance.

Initialization and Enabling

To enable PCIe uncorrected error reporting:

  • Add below device tree properties under each PCIe controller node. This is applicable for both root port mode and endpoint mode.

    • snps,enable-cdm-check and nvidia,enable-safety

  • Add below argument to kernel boot args

    • pci=ecrc=on

An example device tree node:

pcie@14100000 {
                status = "okay";
                snps,enable-cdm-check;
                nvidia,enable-safety;
                phys = <&p2u_hsio_3>;
                phy-names = "p2u-0";
        };

MGBe/EQOS (Ethernet)#

Error Reporting

For MGBE and EQoS, uncorrected/corrected errors are reported directly to HSM. Software based errors are also reported for any functional uncorrected error detected, per controller. Once an uncorrected error occurs on the controller, the specific instance of the controller is deemed unreliable and controller reset is required.

Initialization and Enabling

HSI error reporting from EQoS/MGBE software is enabled by default and no extra configuration is required from the user space.

Known Limitations#

Error reporting frameworks used for eMMC and QSPI only support a queue size of one (1), which means that when multiple errors are detected at the same time, some of them may get dropped. When a non-critical error occurs and is being reported, a critical error may occur and may get dropped. The application must take this into consideration for eMMC/QSPI error handling: one option is to treat all errors from eMMC/QSPI to be same as the most critical error reported from eMMC/QSPI.