Concept Guide

4 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features

o As DRAM based memory shrinks in geometry to grow in capacity, an increasing number

of correctable errors are expected to occur as a natural part of uniform scaling.

Additionally, due to various other DRAM scaling factors (e.g. decreasing cell

capacitance) there is an expected increase in the number of error generating

phenomenon such as Variable Retention Time (VRT) [1] and Random Telegraph Noise

(RTN) [2].

o Within the server industry, it is an increasingly accepted understanding shared by Dell

that some correctable errors per DIMM is unavoidable and does not inherently warrant

a memory module replacement. However, some server competitors will go as far as to

say that an indefinite number of correctable errors are acceptable – a belief that is not

shared by Dell Engineering. Instead, PowerEdge server firmware will intelligently

monitor the health of memory and recommend self-healing action or module

replacement based on a variety of factors including DIMM capacity, rates of correctable

errors, and effectiveness of available self-healing. The intent behind Dell’s proprietary

predictive failure algorithms is to proactively identify DIMMs that are most likely to

continue to degrade and potentially generate uncorrectable errors.

o Uncorrectable Errors (UCEs)

o Uncorrectable errors are multi-bit errors that could not be corrected by the server

platform. These can be caused by any combination of soft or hard errors, but typically

occur as a result of multiple hard errors.

o Not all multi-bit errors are uncorrectable. CPUs that support Advanced ECC can correct

some types of multi-bit errors, depending on the bit error pattern.

o An uncorrectable error can be classified as being:

▪ Detectable and consumed

▪ Detectable and unconsumed

▪ Silent and consumed

▪ Silent and unconsumed

o Consumed vs unconsumed refers to whether the data has been loaded into the CPU

execution path. Unconsumed errors are typically found during Memory Patrol Scrub but

may also be found during a CPU prefetch.

o Detectable vs silent refers to whether the CPU’s ECC scheme can detect the existence of

the error. Silent errors are exceptionally rare and require the problematic cache line to

meet a very specific bit error pattern to by-pass the CPU’s ECC scheme.

▪ Unless otherwise specified, references to uncorrectable errors in this

whitepaper will refer to those classified as detectable.

Detectable

Silent

Consumed

Poisoned upon detection then

Machine Check Exception after

consumption;

Outcome based on OS error

containment

Depends on data usage. It can result in

either incorrect application data or system

service outage