Advanced memory protection for HP ProLiant 300 series G4 servers

Introduction
Advanced Memory Protection (AMP) consists of memory features that provide increased tolerance and
protection from memory failures. There are varying levels of AMP that are supported on ProLiant
servers, depending on the class of server. Refer to the product QuickSpecs for specific information on
the level of features supported on each ProLiant server.
AMP features include Advanced ECC, Online Spare Memory, Memory Mirroring and RAID.
Advanced ECC and Online Spare are supported on 300 series platforms. The focus of this
whitepaper is to detail Advanced ECC and Online Spare support for the 300 series platforms and
will cover how these features are enabled, the configuration rules for using these features, what
utilities can be used for monitoring failures, and how the failures can be repaired.
Memory failures defined
There are differing degrees of memory failures that impact the severity of the state of the server.
Memory errors can be classified into correctable errors and uncorrectable errors.
Correctable errors can be detected and corrected if the chipset and DIMM support this functionality.
Error detection and correction is implemented by storing data and ECC bits on the DIMM. By utilizing
the data and ECC bits, the system can detect memory errors and correct certain types of failures.
Correctable errors are generally single-bit errors. All ProLiant 300-series servers are capable of
detecting and correcting single-bit errors. In addition, ProLiant servers with Advanced ECC support
can detect and correct some multi-bit errors. HP’s Advanced ECC allows detection and correction of
multi-bit failures if all failed bits are contained within a single DRAM device on the DIMM.
Correctable errors can be classified as “hard” and “soft” errors. With a hard error, every access to
the memory location will return an error. A hard error typically indicates a problem with the DIMM.
With a soft error, the data and/or ECC bits on the DIMM are incorrect, but the error will not continue
to occur once the data and/or ECC bits on the DIMM have been corrected. Soft errors are typically
caused by cosmic rays. They are rare but expected occurrences.
Although hard correctable memory errors are corrected by the system and will not result in system
downtime or data corruption, they indicate a problem with the hardware. On the other hand, soft
errors do not indicate any issue with the hardware. Due to this, HP ProLiant servers track the rate of
correctable errors through correctable error thresholding. This allows the system to differentiate
between hard and soft errors. A soft error will not typically cause a DIMM to exceed HP’s correctable
error threshold. On the other hand, a hard error will typically cause a DIMM to exceed HP’s
correctable error threshold. Due to HP’s correctable error thresholding, the user is warned about hard
correctable errors, but is not notified about soft errors which don’t indicate any issue with the
hardware. HP suggests that corrective action be taken if a DIMM is receiving correctable errors at a
rate higher than HP’s correctable error threshold rate. Even though a DIMM has exceeded the
correctable threshold, future errors will continue to be corrected. The system will not shutdown or
crash due to additional correctable errors. However, a DIMM that is receiving correctable errors at a
high rate has a higher probability of receiving an uncorrectable error, which would result in a system
crash or shutdown for systems not configured for the Mirroring or RAID AMP modes.
The user is warned about a DIMM exceeding the correctable error threshold in multiple ways. The
systems internal Health LED will indicate a caution condition. On most ProLiant 300-series servers, an
LED next to the DIMM exceeding the threshold will be illuminated. In addition, if the System
Management Driver and agents are loaded, a message will be logged to both the console and
Systems Insight Manager. Correctable memory errors can typically be isolated to the actual failed
DIMM.
2