Datasheet

Reliability, Availability, Serviceability (RAS)
Intel
®
Xeon
®
Processor C5500/C3500 Series
Datasheet, Volume 1 February 2010
376 Order Number: 323103-001
11.2 System Level RAS
11.2.1 Inband System Management
Inband system management is accomplished by firmware running in high privileged mode (SMM) and
accessing system configuration registers for system event services. In the event of error, fault, or hot
add/remove, firmware is required to determine the system condition and service the event
accordingly. Firmware may enter SMM mode for these events, so that it has the privilege to access
the OS invisible configuration registers.
11.2.2 Outband System Management
Outband system management relies on the out-of-band agents to access system configuration
registers via outband signals. The outband signals, such as SMBus, are assumed to be secured and
have the right to access all registers within a component.
SMBus connected globally to CPUs, IIOs, and PCHs — through a common shared bus hierarchy for
SMBus. By using the outband signals, an outband agent can handle events like hot plug or error
recovery. Outband signals provide the BMC with a global path to access the CSRs in the system
components, even when the CSRs become inaccessible to CPUs through the inband mechanisms.The
SMBus is mastered by the Baseboard Management Controller (BMC) by a platform-specific
mechanism.
To support outband system management, the IIO provides SMBus interface with access to the
configuration registers in the IIO itself or in the downstream IO devices (PCICFG).
11.3 IIO Error Reporting
The IIO logs and reports the detected errors via “system event” generations. In the context of error
reporting, a system event is an event that notifies the system of the error. Two types of system
events can be generated — an inband message to the CPU or an outband signaling to the platform.
In the case of inband messaging, the CPU is notified of the error by the inband message (interrupt,
failed response, etc.). The CPU responds to the inband message and takes the appropriate action to
handle the error.
Outband signaling (error pins) informs an external agent of the error events. An external agent, such
as BMC, may collect the errors from the error pins to determine the health of the system and sends
interrupts to CPU accordingly. In some cases of severe errors, when the system is no longer
responding to inband messages, the outband signalling provides a way to notify the outband system
manager of the error. The system manager can then perform system reset to recover the system
functionality.
The IIO detects errors from the PCIe link, DMI link, Intel
®
QuickPath Interconnect link, or IIO core
itself. An error is first logged and mapped to an error severity, and then mapped to a system event(s)
for error reporting.
IIO error report features are summarized below and detailed in the following sections:
Detect and logs Coherency Interface, PCIe/DMI, Intel
®
QuickData Technology DMA and IIO core
errors.
First and Next error detection and logging for Fatal and Non-Fatal errors.
Allows flexible mapping of the detected errors to different error severity.
Allows flexible mapping of the error severity to different report mechanisms.
Supports PCIe error reporting mechanism.