Managing ProLiant servers with Linux HOWTO

19
Appendix A – Error messages
Messages logged if an ASR event occurs are listed in Table 12.
Table 12. Error messages
Message
number
Details
Message 1 NMI - Automatic Server Recovery timer expiration - Hour %d - %d/%d/%d
Description This message indicates that the Health Monitor detected an ASR timeout and is
attempting to gracefully shut down the operating system. Absence of this message can
indicate a critical hardware failure (such as a non-correctable ECC error on a memory
DIMM) or some other severe event. This is the first of a series of messages displayed to
the console. This message is not be logged to the IML and most likely not be listed in
any system logs.
Recommended
action
Review all the messages logged to the IML to see if any previous errors have been
logged (for example, a corrected single-bit memory error might have been logged).
Message 2 ASR Lockup Detected: %s
Description This message indicates that the Health Monitor detected an ASR timeout and is
attempting to gracefully shut down the operating system. Absence of this ASR message
can indicate a critical hardware failure (such as a non-correctable ECC error on a
memory DIMM) or some other severe event. This is the first ASR message logged to the
IML (if logging is possible).
Recommended
action
Review all the messages logged to the IML to see if any previous errors have been
logged.
Message 3 casm: ASR performed a successful OS shutdown
Description This ASR message indicates that the Health Monitor detected an ASR timeout and has
gracefully shut down the operating system. Absence of this message can indicate a
hardware failure (such as a non-correctable ECC error on a memory DIMM), a high
priority process consuming all the available CPU cycles (software failure), or a device,
such as a storage or network controller, flooding the system with interrupts. This is the
second ASR message logged to the IML if logging is possible.
Recommended
action
This ASR message usually indicates a software error such as a high priority process
consuming all the available CPU cycles. Linux tools, such as SAR (system activity report)
can be used in conjunction with the ASR facility to locate the process causing the
problem.
Message 4 ASR Detected by System ROM
Description This message indicates that the ProLiant Server ROM detected an ASR timeout. This
message is almost always present in the IML when an ASR timeout occurs. If this is the
only ASR message logged to the IML, this can indicate a hardware failure (such as a
non-correctable ECC error on a memory DIMM). The ASR feature on a ProLiant server
resets the server when the timeout expires, with no software intervention required.
Recommended
action
If this is the only ASR message present, this usually indicates a hardware error (such as
an unrecoverable memory error). Try moving the server memory DIMMs to different
slots to see if more information can be logged. Review all IML messages that previously
occurred to see if any other component has given an indication of failure or
temperature limits that might have exceeded normal operating thresholds.