Server Board Family Datasheet

System BIOS Intel® S5000 Server Board Family Datasheet

Revision 1.3

Intel order number D38960-006

The BIOS uses error counters on the Intel

5000 Series Chipsets and internal software counters

to track the number of correctable and Multi-bit correctable errors that occur at runtime. The

chipset increments the count for these counters when an error occurs. The count also decays at

a given rate, programmable by the BIOS. Because of this particular nature of the counters, they

are termed leaky bucket counters.

The leaky bucket counters provide a measurement of the frequency of errors. The BIOS

configures and uses the leaky bucket counters and the decay rate such that it can be notified of

a failing FBDIMM. A failing FBDIMM will typically generate a burst of errors in a short period of

time, which is detected by the leaky bucket algorithm. The chipset maintains separate internal

leaky bucket counters for correctable and multi-bit correctable errors respectively.

The BIOS initializes the correctable error leaky bucket counters to a value of 10 for correctable

ECC errors. These counters are on a per-rank basis. A rank applies to a pair of FBDIMMs on

adjacent channels functioning in lock-stepped mode.

3.3.10.1.3.1 BIOS Policies on Correctable Errors

For each correctable error that occurs before the threshold is reached, the BIOS will log a

Correctable Error SEL entry. No other action will be taken, and the system will continue to

function normally.

When the error threshold reaches 10, the BIOS logs a SEL entry to indicate the correctable

error. In addition, the following steps occur:

1. If sparing is enabled, the chipset initiates a spare fail-over to a spare FBDIMM. In all

other memory configurations, Future correctable errors are masked and no longer

reported to the SEL.

2. The BIOS logs a Max Threshold Reached SEL event.

3. The BIOS sends a DIMM Failed event to the BMC. This causes the BMC to light the

system fault LEDs to initiate memory performance degradation and an assertion of the

failed FBDIMM.

4. The BMC lights the DIMM fault LED for the faulty FBDIMM.

3.3.10.1.4 Multi-bit Correctable Error Counter Threshold

Due to the internal design of the chipset, the same threshold value for correctable errors also

applies to the multi-bit correctable errors. However, maintaining a tolerance level of 10 for multi-

bit correctable errors is undesirable because these are critical errors. Therefore, the BIOS

programs the threshold for multi-bit correctable errors based on the following alternate logic:

 Automatic retries on memory errors: The chipset automatically performs a retry of

memory reads for uncorrectable errors. If the retry results in good data, this is termed a

multi-bit correctable error. If the data is still bad, then it is an uncorrectable error, if

memory controller is not configured to memory mirroring mode. The retry eliminates

transient CRC errors that occur on memory packets transacted over the FBDIMM serial

links between the chipset and the FBDIMMs.

 Internal error reporting by the chipset: The chipset records the occurrence of

uncorrectable errors both at the time of the occurrence, and on the subsequent failure

on retry. Both errors are independently reported to the BIOS. The BIOS will report a