Specifications

QSSC-S4R Technical Product Specification BIOS Initialization
153
16.2.12.1.3 Faulty Data Paths
DDR-3 DIMM technology includes data paths from the DIMMs to the memory controller. Therefore, errors or failures
can occur on the serial path between DDR-3 DIMMs.
These errors are different from ECC errors, and do not necessarily occur as a result of faulty DRAM cells. These errors
are most commonly due to errors or incompatibilities in the SPD information on the DIMM, which cause the memory
channel to fail to train properly.
However, BIOS keeps track of such link-level failures using the same HW Memory BIST engine described in Section
16.2.5.1.
During Memory BIST, when a link failure occurs, the DDR3 DIMMs installed on that channel become unavailable and
are treated as “failed”. The action taken after Memory BIST has completed depends on whether any usable memory
remains. This is described in Section 2.2.9.1.1.
If a fatal link failure occurs during normal operation at runtime (after POST), the ECC engine reports a regular ECC
error.
16.2.12.1.4 Error Counters and Thresholds
The BIOS handles memory errors through a variety of platform-specific policies. Each of these policies is aimed at
providing comprehensive diagnostic support to the system administrator towards system recovery following the failure.
The BIOS uses error counters on the Intel® Xeon® 7500 processor to track the number of correctable and multi-bit
correctable errors that occur at runtime. The Intel® Xeon® 7500 processor’s IMC increments these error counters each
time an error occurs.
16.2.12.1.5 Correctable Error Handling
The BIOS programs a configurable threshold value for correctable errors. Intel® Xeon® 7500 processor is
programmed to generate a notification to the BIOS when the number of errors crosses this threshold. On receiving this
notification, the BIOS logs a SEL entry to indicate the correctable error. In addition, the following steps occur:
1. If DIMM sparing is enabled, the BIOS initiate a spare failover to the spare DIMM. In all memory configurations,
future correctable errors are masked and no longer reported to the SEL.
2. The BIOS logs a single correctable error SEL event.
3. The DIMM is not disabled on reaching CE threshold. Only SMI generation is stopped (to avoid impact to system
performance). And redundancy is not lost when CE threshold is reached if mirroring is enabled.
4. The BIOS instructs the BMC to light the System Fault LEDs to indicate memory performance degradation and an
assertion of the failed DDR3 DIMM.
5. The BIOS also sends the BMC the location of the faulty DDR3 DIMM. The BMC then responds by lighting the
DIMM Fault LED for that DDR3 DIMM.
16.2.12.1.6 Uncorrectable Error Handling
The BIOS programs the Intel® Xeon® 7500 processor for reporting uncorrectable errors to BIOS via SMI whenever an
uncorrectable error occurs. OS may handle the error once BIOS exits the SMI. Optionally, it is possible to configure to
generate an NMI instead of exiting SMI. BIOS SMI handler will take below actions for uncorrectable errors:
1. The BIOS logs an Uncorrectable Memory ECC Error SEL entry in the BMC SEL.
2. The BIOS then sends the command to the BMC to light up the System Fault LED and the DIMM Fault LED for the
faulty DDR-3 DIMM.
3. If Mirroring is enabled, the BIOS logs a Redundancy Lost event, and transitions system to degraded mode on an
uncorrectable error.
16.2.12.2 Mechanisms of Memory Error Reporting
Memory errors are reported through a variety of platform-specific elements, as described in this section.
Table 83. Memory Error Reporting Agent Summary
Platform Element Description
Event Logging
When a memor
y
error occurs at runtime, the BIOS logs the error into the s
y
stem
event log (SEL) in the Baseboard Management Controller‘s (BMC) repository.