Advanced memory protection for HP ProLiant 300 series G4 servers

Since the data and ECC bits are written to the DIMM on memory writes, and checked on memory
reads. A soft error could result in multiple correctable memory errors occurring if the processor
continually read a memory location containing a soft memory error. If a write to that memory location
occurred, the error would disappear. However, after the soft error results in the data and ECC bits
being out of synch., every read to that memory location would result in a correctable error until a
write to that memory location occurred. This could result in a soft error resulting in a DIMM exceeding
the correctable error threshold.
Memory scrubbing is a method of solving this problem. There are two types of memory scrubbing
supported by the ProLiant 300-series G4 platforms. The G4 systems and previous generations have
supported something known as demand scrubbing. The G4 systems are the first ProLiant servers to
support what is known as background or patrol scrubbing.
Demand scrubbing solves the problem of obtaining multiple correctable errors due to a single soft
error, and thus the problem of potentially reporting a correctable threshold error due to soft errors.
Whenever the system detects a correctable error, the system will correct the data and pass the data to
the requester, whether that be the processor or a DMA capable device. With demand scrubbing, the
correct data and ECC check bits will also be written back to memory. In other words, when the system
detects a correctable error via the data and ECC bits, it writes back the proper data and ECC bits to
memory. Thus, subsequent reads of the same memory location will not result in a correctable error if
the error was simply a soft error. If there was a hard error and something actually wrong with the
DIMM, writing the correct data and ECC bits back to memory would typically not correct the problem,
and additional correctable errors will occur on subsequent reads.
Background scrubbing (also known as patrol scrubbing) is a very similar process. Instead of only
reading the data and ECC bits, correcting them, and writing them back to memory when a
correctable memory error occurs, the system will constantly be reading and writing memory locations.
Thus, the system will be constantly scrubbing all of the contents of memory in an effort to correct soft
errors before a correctable error even occurs. Even if a particular section of memory is not being
accessed by software or DMA capable devices, background scrubbing will correct any soft errors that
exist in the memory. Background scrubbing occurs at a very slow rate and only when the memory bus
is available. Thus, the memory accesses due to background scrubbing do not affect normal system
operation or system performance. If background scrubbing detects an uncorrectable memory error, it
does not cause the system to crash or result in an NMI.
Background scrubbing serves two purposes. First, it reduces the chances of the system receiving a
correctable error on memory reads initiated by software or DMA-capable devices. While demand
scrubbing prevents multiple correctable errors due to a soft error, one correctable error will occur on
the initial memory read access. If a soft error occurs in memory (ie. if a cosmic ray inverts a bit on a
DRAM device), the background scrub may correct the error before any normal memory read occurs to
the memory location that had been affected. Second and more importantly, background scrubbing
reduces the chances of an uncorrectable error occurring due to a soft error. Although rare, it is
possible that a portion of memory that is not being accessed for a long time could have multiple bit
positions inverted by cosmic rays. For instance, if a bit in memory is inverted by a cosmic ray, but the
memory is never read or written for a relatively long period of time, this would leave a window where
an additional bit in the same memory location could be inverted by cosmic rays. In this case, multiple
bits in the memory location could be inverted, which could potentially result in a system crash and
NMI when the memory is read (in Advanced ECC, the system crash would occur if both inverted bits
were not in the same DRAM device on the DIMM).
11