Concept Guide

16 Memory Errors and Dell PowerEdge YX4X Server Memory RAS Features

o Benefit: Patrol scrub will run every four hours (instead of 24); increased frequency will

reduce the accumulation of errors in areas of memory with low utilization and thus not

being corrected by demand scrub

It is also recommended that users keep their PowerEdge server firmware up to date, especially server

BIOS. This is because even after products ship, PowerEdge server development continuously works to

improve its RAS algorithms and behaviors for an optimal customer experience. Users will also benefit

from keeping BIOS up to date by receiving regular maintenance releases to their platform memory

reference code.

FYI: Memory Reference Code (MRC) is a BIOS code that performs memory training,

configuration, and link optimization.

As an example, since the version 1.0 publication of this whitepaper, several new features have been

introduced in the latest versions of PowerEdge BIOS. See “What’s New in BIOS 2.8.2” for additional

details.

Recommended User Actions When Encountering Memory Errors

Reminder: The list of common memory errors and recommended response

actions detailed below are for customers running PowerEdge BIOS versions 2.8.2 or

higher. Customers with earlier BIOS versions should refer to v1.0 of the RAS

whitepaper.

The following is a list of the most common memory errors (as reported in the system event log) and the

recommended user response actions:

• MEM0001 – This is an indication that the system has consumed an uncorrectable memory error

at the specified DIMM location in the event message. Based on the OS error containment (MCA

Recovery) process, the server may see one of three possible outcomes:

1. Kernel panic

2. Application or VM termination

3. Application or VM recovery

o Recommended Response Action: Perform a cold reboot of the server if it has not done

so automatically. PowerEdge server BIOS will perform self-healing at the affected DIMM

location (note that BIOS may initiate more reboots during this process). Do not remove

or swap the DIMM at the specified location in the event message. Look for the system

to report MEM0804 or MEM0805 for next steps. Repeat a cold reboot if neither of these

events are reported. Contact Dell technical support if neither of these two events are

reported after a second attempt.

• MEM0802 – This is an indication that the system is encountering correctable errors at the

specified DIMM location and would benefit from Dell memory self-healing.

o Recommended Response Action: Perform a cold reboot of the server at the earliest

convenience. PowerEdge server BIOS will perform self-healing at the affected DIMM