Advanced memory protection for HP ProLiant 300 series G4 servers

While correctable errors do not affect the normal operation of the system, uncorrectable memory
errors will immediately result in a system crash or shutdown of the system when not configured for
Mirroring or RAID AMP modes. Uncorrectable errors are detected by ProLiant 300-series servers, but
cannot be corrected. ProLiant 500-series and 700-series platforms with Mirroring or RAID AMP
support are capable of protecting against uncorrectable memory errors. Uncorrectable errors are
always multi-bit memory errors. For systems with Advanced ECC support, multi-bit memory errors
within the same DRAM device on the DIMM are not uncorrectable. However, if multiple bits are failed
on different DRAM devices on a DIMM, the error will be uncorrectable. When a system receives an
uncorrectable error and is not in an AMP mode providing protection against these errors, the system
will NMI. The internal Health LED will indicate a critical condition, and on most systems, the LEDs next
to the failed DIMMs will be illuminated. In addition, the error will be logged if the Systems
Management Driver is loaded. In certain cases (typically when the failed memory is in the first Bank of
memory), the NMI handler will be incapable of running because the memory where the NMI handler
resides will be corrupted. In these cases, the system will typically hard lock without any additional
indication regarding the failure. Uncorrectable memory errors can typically only be isolated down to
a failed Bank of DIMMs, rather than the DIMM itself.
Protection from memory failures
There are six levels of protection from memory errors that are supported by HP. In this whitepaper, the
focus will be on those levels of protection supported by the 300-series G4 class of servers. Each level
of protection requires server support.
The base level of memory protection available is parity protection. All ProLiant 300-series platforms
provide memory protection beyond that provided by parity. Parity can detect when a single-bit error
occurs, but cannot correct it. When a single-bit error occurs on a system with parity protection, the
system will hard lock causing a non-maskable interrupt (NMI). Thus, single-bit errors are uncorrectable
errors on a system with parity protection. In parity mode, there is no protection from any level of
memory failures because the ability to correct the failure does not exist.
The next level of protection is Standard ECC. Standard ECC requires chipset and DIMM level support
and provides the capability to detect and correct a single-bit error on a memory access. When a
single-bit error occurs, the system will detect the error and correct the data. Thus, the system will
continue to operate normally. With Standard ECC, all multi-bit memory errors will be detected, but not
corrected. Multi-bit errors are uncorrectable and will result in a system crash and NMI.
A more robust level of protection is provided by Advanced ECC, also known in the industry as
“Chipkill.” Advanced ECC requires chipset and DIMM support and provides a higher level of
protection over Standard ECC. Like Standard ECC, Advanced ECC will detect and correct single-bit
errors. However, Advanced ECC will also detect and correct multi-bit errors if all failed bits are within
a single DRAM device on the DIMM. An entire DRAM device on the DIMM can be failed, and the
system will continue to operate normally. If there are multiple bits of failure that occur on multiple
DRAM devices on the DIMM, the error cannot be corrected with Advanced ECC support, and the
system will crash and NMI.
HP offers memory protection beyond those features listed above. ProLiant 300-series servers support
Online Spare Mode. With Online Spare enabled, the system still takes advantage of Advanced ECC.
In Online Spare Mode, one bank of memory is designated as the spare bank. In this mode, the
designated bank is not used for total available system memory. If the correctable error threshold is
exceeded by a DIMM in a particular bank of memory, that bank will be taken offline and the spare
bank activated instead. Once the original bank is deactivated, the system will not utilize the memory
that exhibited the failure. After switching to the spare bank of memory, the system will continue to
monitor correctable threshold errors and log any failures. If an uncorrectable memory error occurs
before or after the online spare switchover, the system will crash and NMI. However, the memory
3