Concept Guide
4 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features
o As DRAM based memory shrinks in geometry to grow in capacity, an increasing number
of correctable errors are expected to occur as a natural part of uniform scaling.
Additionally, due to various other DRAM scaling factors (e.g. decreasing cell
capacitance) there is an expected increase in the number of error generating
phenomenon such as Variable Retention Time (VRT) [1] and Random Telegraph Noise
(RTN) [2].
o Within the server industry, it is an increasingly accepted understanding shared by Dell
that some correctable errors per DIMM is unavoidable and does not inherently warrant
a memory module replacement. However, some server competitors will go as far as to
say that an indefinite number of correctable errors are acceptable – a belief that is not
shared by Dell Engineering. Instead, PowerEdge server firmware will intelligently
monitor the health of memory and recommend self-healing action or module
replacement based on a variety of factors including DIMM capacity, rates of correctable
errors, and effectiveness of available self-healing. The intent behind Dell’s proprietary
predictive failure algorithms is to proactively identify DIMMs that are most likely to
continue to degrade and potentially generate uncorrectable errors.
o Uncorrectable Errors (UCEs)
o Uncorrectable errors are multi-bit errors that could not be corrected by the server
platform. These can be caused by any combination of soft or hard errors, but typically
occur as a result of multiple hard errors.
o Not all multi-bit errors are uncorrectable. CPUs that support Advanced ECC can correct
some types of multi-bit errors, depending on the bit error pattern.
o An uncorrectable error can be classified as being:
▪ Detectable and consumed
▪ Detectable and unconsumed
▪ Silent and consumed
▪ Silent and unconsumed
o Consumed vs unconsumed refers to whether the data has been loaded into the CPU
execution path. Unconsumed errors are typically found during Memory Patrol Scrub but
may also be found during a CPU prefetch.
o Detectable vs silent refers to whether the CPU’s ECC scheme can detect the existence of
the error. Silent errors are exceptionally rare and require the problematic cache line to
meet a very specific bit error pattern to by-pass the CPU’s ECC scheme.
▪ Unless otherwise specified, references to uncorrectable errors in this
whitepaper will refer to those classified as detectable.
Detectable
Silent
Consumed
Poisoned upon detection then
Machine Check Exception after
consumption;
Outcome based on OS error
containment
Depends on data usage. It can result in
either incorrect application data or system
service outage