Technical Product Specification

Table Of Contents
Intel
®
Server Board S5400SF TPS Functional Architecture
Revision 2.02
Intel order number: D92944-007
31
The BIOS initializes the correctable error leaky bucket counters to a value of ten for correctable
ECC errors. These counters are on a per-rank basis. A rank applies to a pair of FBDIMMs on
adjacent channels functioning in lock-stepped mode.
3.2.3.9.3.1 BIOS Policies on Correctable Errors
For each correctable error that occurs before the threshold is reached, the BIOS logs a
“Correctable Error” SEL entry. No other action is taken, and the system continues to function
normally.
When the error threshold reaches ten, the BIOS logs a SEL entry to indicate the correctable
error. In addition, the following steps occur:
1. If sparing is enabled, the chipset initiates a spare fail-over to a spare FBDIMM. In all
other memory configurations, future correctable errors are masked and no longer
reported to the SEL.
2. The BIOS logs a “Max Threshold Reached” SEL event.
3. The BIOS sends a “DIMM Failed” event to the Integrated BMC. This causes the
Integrated BMC to light the System Fault LEDs to initiate memory performance
degradation and an assertion of the failed FBDIMM.
4. The BIOS lights the DIMM Fault LED for the faulty FBDIMM.
3.2.3.9.4 Multi-bit Correctable Error Counter Threshold
Due to the internal design of the chipset, the same threshold value for correctable errors also
applies to the multi-bit correctable errors. However, maintaining a tolerance level of 10 for multi-
bit correctable errors is undesirable because these are critical errors. Therefore, the BIOS
programs the threshold for multi-bit correctable errors based on the following alternate logic:
Automatic retries on memory errors: The chipset automatically performs a retry of
memory reads for uncorrectable errors. If the retry results in good data, this is termed a
multi-bit correctable error. If the data is still bad, then it is an uncorrectable error.
Another memory error is a CRC error on the FBDIMM serial path. CRC errors are also
retried in a similar manner. The retry eliminates transient errors that occur on memory
packets transacted over the FBDIMM serial links between the chipset and the FBDIMMs.
Internal error reporting by the chipset: The chipset records the occurrence of
uncorrectable errors both at the time of the occurrence, and on the subsequent failure
on retry. Both errors are independently reported to the BIOS. The BIOS reports a
successful retry as “Correctable Memory Error” in the SEL regardless of whether the
originating error was a CRC error or an ECC error.
3.2.3.9.5 FBD Fatal Error Threshold
In addition to standard ECC errors, the BIOS monitors FBD protocol errors reported by the
chipset. FBD protocol errors cause degradation of system memory, and hence it is pointless to
tolerate them to any level. The BIOS maintains an internal software counter to handle FBD
errors. The threshold of this software counter is one. When the threshold is met, it is treated as
an uncorrectable error and follows the same policy as outlined below.