Advanced memory protection for HP ProLiant 300 series G4 servers

Since the data and ECC bits are written to the DIMM on memory writes, and checked on memory

reads. A soft error could result in multiple correctable memory errors occurring if the processor

continually read a memory location containing a soft memory error. If a write to that memory location

occurred, the error would disappear. However, after the soft error results in the data and ECC bits

being out of synch., every read to that memory location would result in a correctable error until a

write to that memory location occurred. This could result in a soft error resulting in a DIMM exceeding

the correctable error threshold.

Memory scrubbing is a method of solving this problem. There are two types of memory scrubbing

supported by the ProLiant 300-series G4 platforms. The G4 systems and previous generations have

supported something known as demand scrubbing. The G4 systems are the first ProLiant servers to

support what is known as background or patrol scrubbing.

Demand scrubbing solves the problem of obtaining multiple correctable errors due to a single soft

error, and thus the problem of potentially reporting a correctable threshold error due to soft errors.

Whenever the system detects a correctable error, the system will correct the data and pass the data to

the requester, whether that be the processor or a DMA capable device. With demand scrubbing, the

correct data and ECC check bits will also be written back to memory. In other words, when the system

detects a correctable error via the data and ECC bits, it writes back the proper data and ECC bits to

memory. Thus, subsequent reads of the same memory location will not result in a correctable error if

the error was simply a soft error. If there was a hard error and something actually wrong with the

DIMM, writing the correct data and ECC bits back to memory would typically not correct the problem,

and additional correctable errors will occur on subsequent reads.

Background scrubbing (also known as patrol scrubbing) is a very similar process. Instead of only

reading the data and ECC bits, correcting them, and writing them back to memory when a

correctable memory error occurs, the system will constantly be reading and writing memory locations.

Thus, the system will be constantly scrubbing all of the contents of memory in an effort to correct soft

errors before a correctable error even occurs. Even if a particular section of memory is not being

accessed by software or DMA capable devices, background scrubbing will correct any soft errors that

exist in the memory. Background scrubbing occurs at a very slow rate and only when the memory bus

is available. Thus, the memory accesses due to background scrubbing do not affect normal system

operation or system performance. If background scrubbing detects an uncorrectable memory error, it

does not cause the system to crash or result in an NMI.

Background scrubbing serves two purposes. First, it reduces the chances of the system receiving a

correctable error on memory reads initiated by software or DMA-capable devices. While demand

scrubbing prevents multiple correctable errors due to a soft error, one correctable error will occur on

the initial memory read access. If a soft error occurs in memory (ie. if a cosmic ray inverts a bit on a

DRAM device), the background scrub may correct the error before any normal memory read occurs to

the memory location that had been affected. Second and more importantly, background scrubbing

reduces the chances of an uncorrectable error occurring due to a soft error. Although rare, it is

possible that a portion of memory that is not being accessed for a long time could have multiple bit

positions inverted by cosmic rays. For instance, if a bit in memory is inverted by a cosmic ray, but the

memory is never read or written for a relatively long period of time, this would leave a window where

an additional bit in the same memory location could be inverted by cosmic rays. In this case, multiple

bits in the memory location could be inverted, which could potentially result in a system crash and

NMI when the memory is read (in Advanced ECC, the system crash would occur if both inverted bits

were not in the same DRAM device on the DIMM).