Technical Product Specification

Table Of Contents

Intel

Server Board S5400SF TPS Functional Architecture

Revision 2.02

Intel order number: D92944-007

The BIOS initializes the correctable error leaky bucket counters to a value of ten for correctable

ECC errors. These counters are on a per-rank basis. A rank applies to a pair of FBDIMMs on

adjacent channels functioning in lock-stepped mode.

3.2.3.9.3.1 BIOS Policies on Correctable Errors

For each correctable error that occurs before the threshold is reached, the BIOS logs a

“Correctable Error” SEL entry. No other action is taken, and the system continues to function

normally.

When the error threshold reaches ten, the BIOS logs a SEL entry to indicate the correctable

error. In addition, the following steps occur:

1. If sparing is enabled, the chipset initiates a spare fail-over to a spare FBDIMM. In all

other memory configurations, future correctable errors are masked and no longer

reported to the SEL.

2. The BIOS logs a “Max Threshold Reached” SEL event.

3. The BIOS sends a “DIMM Failed” event to the Integrated BMC. This causes the

Integrated BMC to light the System Fault LEDs to initiate memory performance

degradation and an assertion of the failed FBDIMM.

4. The BIOS lights the DIMM Fault LED for the faulty FBDIMM.

3.2.3.9.4 Multi-bit Correctable Error Counter Threshold

Due to the internal design of the chipset, the same threshold value for correctable errors also

applies to the multi-bit correctable errors. However, maintaining a tolerance level of 10 for multi-

bit correctable errors is undesirable because these are critical errors. Therefore, the BIOS

programs the threshold for multi-bit correctable errors based on the following alternate logic:

 Automatic retries on memory errors: The chipset automatically performs a retry of

memory reads for uncorrectable errors. If the retry results in good data, this is termed a

multi-bit correctable error. If the data is still bad, then it is an uncorrectable error.

Another memory error is a CRC error on the FBDIMM serial path. CRC errors are also

retried in a similar manner. The retry eliminates transient errors that occur on memory

packets transacted over the FBDIMM serial links between the chipset and the FBDIMMs.

 Internal error reporting by the chipset: The chipset records the occurrence of

uncorrectable errors both at the time of the occurrence, and on the subsequent failure

on retry. Both errors are independently reported to the BIOS. The BIOS reports a

successful retry as “Correctable Memory Error” in the SEL regardless of whether the

originating error was a CRC error or an ECC error.

3.2.3.9.5 FBD Fatal Error Threshold

In addition to standard ECC errors, the BIOS monitors FBD protocol errors reported by the

chipset. FBD protocol errors cause degradation of system memory, and hence it is pointless to

tolerate them to any level. The BIOS maintains an internal software counter to handle FBD

errors. The threshold of this software counter is one. When the threshold is met, it is treated as

an uncorrectable error and follows the same policy as outlined below.