Error Reporting#

The following is a list of acronyms and abbreviations used in this section:

Acronym / Abbreviation

Definition

CCPLEX

CarmelCPU Complex. Set of Arm cores in NVIDIA DRIVE AGX Thor for user application.

FSI

FSI(Functional SafetyIsland) running AUTOSAR Safety Application.

MCU

Microcontroller unit. Infineon AURIX TC397 on NVIDIA DRIVE AGX Thor Platform.

HSP

hardware synchronization primitives - hardware block providing Mailbox functionality.

EPD

Error Propagation Daemon.

EPL

Error Propagation Library.

EPS

Error Propagation Server.

SEH

System error handler.

MCUFOH

Failover Handler on Microcontroller Unit.

FSIFOH

Failover Handler on Functional Safety Island.

GOS

GuestOSpartition on CCPLEX.

Update VM

Drive Update Virtual Machine (a GuestOS Partition on CCPLEX)

ATF

Arm Trusted Firmware

BPMP

Boot and power management processor

HSI

Hardware software interface

DT

Device Tree

Overview#

Safety Services is a software framework that detects/reads the hardware errors through the HSMs and ECs defined by Thor HSI and also enables different software elements to report the software [detected] errors.

The high level block diagram below shows the interaction between different sub-elements within Safety Services:

Figure 1. Block Diagram error_image

Error Propagation Library (EPL) provides an interface to clients running on CCPLEX to report errors. It is a dynamic linked shared object and forwards error report packets to the daemon (EPD).

Error propagation daemon (EPD) on AV + L is responsible for parsing the DT and sending it to FSI.

tegra-epl kernel driver receives data from userspace EPL. This provides an interface for different instances of user space error propagation library to connect. It serializes error reports from all instances of EPL and sends it to EPS (FSI) via TOP3_HSP.

Error propagation server (EPS) is a central entity running on FSI responsible for collecting all hardware (HSM) and software errors. EPS connects to TOP3_HSP to receive error reports from CCPLEX clients sent via EPL/tegra-epl. EPS also receives interrupts on HSM errors and reads HSM/EC registers to determine hardware error source. EPS then reports these errors to System Error Handler (SEH).

SEH runs on FSI and is responsible for handling all the errors reported to FSI. DriveOS provides a sample dummy implementation for SEH named “SEH placeholder”. SEH placeholder assumes (for demonstration purpose) some error reports as critical failure and forward them to FOH. Concrete implementation of SEH is assumed to be done by the end software integrator i.e., DRIVE AV or OEM/Tier1.

Failover Handler on Functional Safety Island (FSI FOH) is responsible for communication of failure to MCU via SPI and SOC ERROR PIN. FSI FOH receives critical failure reports from SEH placeholder and sends the failure report data to MCU FOH via SPI and asserts SOC ERROR PIN.

Failover Handler on Micro Controller Unit (MCU FOH) is an entity running on MCU responsible for communication with Safety Services running on Thor SoC. MCU FOH receives critical error reports from FSI FOH via SPI.

Error reporting use case is demonstrated by demo app – DemoAppSwErr, which runs on CCPLEX and demonstrates usage of EPL to report errors.

SehDriveOS#

DriveOS provides a sample dummy implementation for SEH named as “SehDriveOS.” Interface block diagram for SehDriveOS:

seh_image

List of Interfaces#

Interface Name

Description

P_SEH_ReportCriticalFailure

Critical Failure information is provided to FSI FOH using this interface to be sent to MCU.

R_ClearError

It is the client port on SEH, which is used to clear any active error. It is provided by EPS.

R_ErrorReport

It is the receiver port on SEH, which used to receive Error Report frames from EPS.

R_DeInitNotif

It is the receiver port on SEH, which is used to receive the notification of SC7 Entry and shutdown notifications from EPS.

P_Deinit_Flag

It is the sender port on SEH, which is used to notify FSI FOH that all the critical failure reports are sent to FSI FOH.

P_McuPeriodicReport

It is the sender port on SEH, which is used to send periodic status to MCU. The datatype for this interface is a 252-byte array. The SEH implementation determines the data structure of the periodic report.

R_StateChange_Request_Rx

It is the receiver port on SEH, to receive state change notification from NvCCPLEX_FSI_App.

P_StateChange_Reply_Tx

It is the sender port on SEH, which is used to respond to state change notification received from NvCCPLEX_FSI_App. In current placeholder implementation this is not used.

P_TriggerHsmReset

It is the server port on SEH, this is used by DramECC error handler to notify SEH to trigger L1 reset for performing DRAM page retirement. (Refer FSI-SW Delivery Package for details on DRAM ECC error handling sequence).

R_PerformHsmReset

It is the client port on SEH, which is used to perform L1 reset by triggering the interface provided by EPS.

R_SocErrPinAssrt

It is the client port on SEH, which is used to assert SOC error pin for the software reported errors.

R_DramEccEh

It is the client port on SEH, which is used to trigger Dram ECC uncorrected error handler when SEH receives DRAM ECC uncorrected error notification.

R_McuData

It is a receiver port for receiving data from the MCU Application. The data from Mthe CU Application could be data from persistence memory on MCU.

R_Swt_Service_SEH

This is client port, which is used to read time stamp from Swt Cdd

R_NvHspMailboxConsume

This is the client port on SEH , which is used for polling QM errors from CCPLEX

Service Ports#

Port Name

Description

P_ESH_DOS_State_Request_BswM

It is the interface to request DOS states of BswM.

P_ESH_User_Request_BswM

It is the interface to release current BswM state.

R_ESH_Mode

It is the interface to read the current BswM state.

Note

Refer to the FSI Software release document for details on BswM states. These service port interfaces are optional.