Advanced memory protection for HP ProLiant 300 series G4 servers

ManualsBrandsHP ManualsServerHP ProLiant ML350 G4 Server

While correctable errors do not affect the normal operation of the system, uncorrectable memory

errors will immediately result in a system crash or shutdown of the system when not configured for

Mirroring or RAID AMP modes. Uncorrectable errors are detected by ProLiant 300-series servers, but

cannot be corrected. ProLiant 500-series and 700-series platforms with Mirroring or RAID AMP

support are capable of protecting against uncorrectable memory errors. Uncorrectable errors are

always multi-bit memory errors. For systems with Advanced ECC support, multi-bit memory errors

within the same DRAM device on the DIMM are not uncorrectable. However, if multiple bits are failed

on different DRAM devices on a DIMM, the error will be uncorrectable. When a system receives an

uncorrectable error and is not in an AMP mode providing protection against these errors, the system

will NMI. The internal Health LED will indicate a critical condition, and on most systems, the LEDs next

to the failed DIMMs will be illuminated. In addition, the error will be logged if the Systems

Management Driver is loaded. In certain cases (typically when the failed memory is in the first Bank of

memory), the NMI handler will be incapable of running because the memory where the NMI handler

resides will be corrupted. In these cases, the system will typically hard lock without any additional

indication regarding the failure. Uncorrectable memory errors can typically only be isolated down to

a failed Bank of DIMMs, rather than the DIMM itself.

Protection from memory failures

There are six levels of protection from memory errors that are supported by HP. In this whitepaper, the

focus will be on those levels of protection supported by the 300-series G4 class of servers. Each level

of protection requires server support.

The base level of memory protection available is parity protection. All ProLiant 300-series platforms

provide memory protection beyond that provided by parity. Parity can detect when a single-bit error

occurs, but cannot correct it. When a single-bit error occurs on a system with parity protection, the

system will hard lock causing a non-maskable interrupt (NMI). Thus, single-bit errors are uncorrectable

errors on a system with parity protection. In parity mode, there is no protection from any level of

memory failures because the ability to correct the failure does not exist.

The next level of protection is Standard ECC. Standard ECC requires chipset and DIMM level support

and provides the capability to detect and correct a single-bit error on a memory access. When a

single-bit error occurs, the system will detect the error and correct the data. Thus, the system will

continue to operate normally. With Standard ECC, all multi-bit memory errors will be detected, but not

corrected. Multi-bit errors are uncorrectable and will result in a system crash and NMI.

A more robust level of protection is provided by Advanced ECC, also known in the industry as

“Chipkill.” Advanced ECC requires chipset and DIMM support and provides a higher level of

protection over Standard ECC. Like Standard ECC, Advanced ECC will detect and correct single-bit

errors. However, Advanced ECC will also detect and correct multi-bit errors if all failed bits are within

a single DRAM device on the DIMM. An entire DRAM device on the DIMM can be failed, and the

system will continue to operate normally. If there are multiple bits of failure that occur on multiple

DRAM devices on the DIMM, the error cannot be corrected with Advanced ECC support, and the

system will crash and NMI.

HP offers memory protection beyond those features listed above. ProLiant 300-series servers support

Online Spare Mode. With Online Spare enabled, the system still takes advantage of Advanced ECC.

In Online Spare Mode, one bank of memory is designated as the spare bank. In this mode, the

designated bank is not used for total available system memory. If the correctable error threshold is

exceeded by a DIMM in a particular bank of memory, that bank will be taken offline and the spare

bank activated instead. Once the original bank is deactivated, the system will not utilize the memory

that exhibited the failure. After switching to the spare bank of memory, the system will continue to

monitor correctable threshold errors and log any failures. If an uncorrectable memory error occurs

before or after the online spare switchover, the system will crash and NMI. However, the memory