Server Board Family Datasheet

System BIOS Intel® S5000 Server Board Family Datasheet
Revision 1.3
Intel order number D38960-006
42
The BIOS uses error counters on the Intel
®
5000 Series Chipsets and internal software counters
to track the number of correctable and Multi-bit correctable errors that occur at runtime. The
chipset increments the count for these counters when an error occurs. The count also decays at
a given rate, programmable by the BIOS. Because of this particular nature of the counters, they
are termed leaky bucket counters.
The leaky bucket counters provide a measurement of the frequency of errors. The BIOS
configures and uses the leaky bucket counters and the decay rate such that it can be notified of
a failing FBDIMM. A failing FBDIMM will typically generate a burst of errors in a short period of
time, which is detected by the leaky bucket algorithm. The chipset maintains separate internal
leaky bucket counters for correctable and multi-bit correctable errors respectively.
The BIOS initializes the correctable error leaky bucket counters to a value of 10 for correctable
ECC errors. These counters are on a per-rank basis. A rank applies to a pair of FBDIMMs on
adjacent channels functioning in lock-stepped mode.
3.3.10.1.3.1 BIOS Policies on Correctable Errors
For each correctable error that occurs before the threshold is reached, the BIOS will log a
Correctable Error SEL entry. No other action will be taken, and the system will continue to
function normally.
When the error threshold reaches 10, the BIOS logs a SEL entry to indicate the correctable
error. In addition, the following steps occur:
1. If sparing is enabled, the chipset initiates a spare fail-over to a spare FBDIMM. In all
other memory configurations, Future correctable errors are masked and no longer
reported to the SEL.
2. The BIOS logs a Max Threshold Reached SEL event.
3. The BIOS sends a DIMM Failed event to the BMC. This causes the BMC to light the
system fault LEDs to initiate memory performance degradation and an assertion of the
failed FBDIMM.
4. The BMC lights the DIMM fault LED for the faulty FBDIMM.
3.3.10.1.4 Multi-bit Correctable Error Counter Threshold
Due to the internal design of the chipset, the same threshold value for correctable errors also
applies to the multi-bit correctable errors. However, maintaining a tolerance level of 10 for multi-
bit correctable errors is undesirable because these are critical errors. Therefore, the BIOS
programs the threshold for multi-bit correctable errors based on the following alternate logic:
Automatic retries on memory errors: The chipset automatically performs a retry of
memory reads for uncorrectable errors. If the retry results in good data, this is termed a
multi-bit correctable error. If the data is still bad, then it is an uncorrectable error, if
memory controller is not configured to memory mirroring mode. The retry eliminates
transient CRC errors that occur on memory packets transacted over the FBDIMM serial
links between the chipset and the FBDIMMs.
Internal error reporting by the chipset: The chipset records the occurrence of
uncorrectable errors both at the time of the occurrence, and on the subsequent failure
on retry. Both errors are independently reported to the BIOS. The BIOS will report a