Concept Guide

16 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features
o Benefit: Patrol scrub will run every four hours (instead of 24); increased frequency will
reduce the accumulation of errors in areas of memory with low utilization and thus not
being corrected by demand scrub
It is also recommended that users keep their PowerEdge server firmware up to date, especially server
BIOS. This is because even after products ship, PowerEdge server development continuously works to
improve its RAS algorithms and behaviors for an optimal customer experience. Users will also benefit
from keeping BIOS up to date by receiving regular maintenance releases to their platform memory
reference code.
FYI: Memory Reference Code (MRC) is a BIOS code that performs memory training,
configuration, and link optimization.
As an example, since the version 1.0 publication of this whitepaper, several new features have been
introduced in the latest versions of PowerEdge BIOS. See “What’s New in BIOS 2.8.2” for additional
details.
Recommended User Actions When Encountering Memory Errors
Reminder: The list of common memory errors and recommended response
actions detailed below are for customers running PowerEdge BIOS versions 2.8.2 or
higher. Customers with earlier BIOS versions should refer to v1.0 of the RAS
whitepaper.
The following is a list of the most common memory errors (as reported in the system event log) and the
recommended user response actions:
MEM0001 This is an indication that the system has consumed an uncorrectable memory error
at the specified DIMM location in the event message. Based on the OS error containment (MCA
Recovery) process, the server may see one of three possible outcomes:
1. Kernel panic
2. Application or VM termination
3. Application or VM recovery
o Recommended Response Action: Perform a cold reboot of the server if it has not done
so automatically. PowerEdge server BIOS will perform self-healing at the affected DIMM
location (note that BIOS may initiate more reboots during this process). Do not remove
or swap the DIMM at the specified location in the event message. Look for the system
to report MEM0804 or MEM0805 for next steps. Repeat a cold reboot if neither of these
events are reported. Contact Dell technical support if neither of these two events are
reported after a second attempt.
MEM0802 This is an indication that the system is encountering correctable errors at the
specified DIMM location and would benefit from Dell memory self-healing.
o Recommended Response Action: Perform a cold reboot of the server at the earliest
convenience. PowerEdge server BIOS will perform self-healing at the affected DIMM