HP Insight Management Agents 8.40 Managing ProLiant Servers with Linux HOW TO Whitepaper

Managing ProLiant Servers with Linux HOWTO v8.40
23
Appendix A – Error messages
Messages logged if an ASR event occurs are listed in Table 14.
Table 14: Error messages
Message Number Details
Message 1 NMI-Automatic Server Recovery timer expiration – Hour %d-%d/%d/%d
Description This message indicates that the Health Monitor detected an ASR timeout and is attempting
to gracefully shut down the Operating System. Absence of this message can indicate a
critical hardware failure (such as a non-correctable ECC error on a memory DIMM) or
some other severe event. This is the first of a series of messages displayed to the console.
This message is not logged to the IML and most likely not listed in any system logs.
Recommended
action
Review all the messages logged to the IML to see if any previous errors have been
logged. For example, a corrected single-bit memory error might have been logged.
Message 2 ASR Lockup Detected: %s
Description This message indicates that the Health Monitor detected an ASR timeout and is attempting
to gracefully shut down the Operating System. Absence of this message can indicate a
critical hardware failure (such as a non-correctable ECC error on a memory DIMM) or
some other severe event. This is the first ASR message logged to the IML, if logging is
possible.
Recommended
action
Review all the messages logged to the IML to see if any previous errors have been
logged.
Message 3 casm: ASR performed a successful OS shutdown
Description This ASR message indicates that the Health monitor detected an ASR timeout and has
gracefully shut down the Operating System. Absence of this message can indicate a
critical hardware failure (such as a non-correctable ECC error on a memory DIMM), a
high priority process consuming all the available CPU cycles (software failure), or a
device such as a storage or a network controller flooding the system with interrupts. This is
the second ASR message logged to the IML, if logging is possible.
Recommended
action
This ASR message usually indicates a software error such as a high priority process
consuming all the available CPU cycles. Linux tools such as “sar” (system activity report)
can be used in conjunction with the ASR facility to locate the process causing the
problem.
Message 4 ASR Detected by System ROM
Description This message indicates that the ProLiant Server ROM detected an ASR timeout. This
message is almost always present in the IML when an ASR timeout occurs. If this is the
only ASR message logged to the IML, this can indicate a hardware failure such as a non-
correctable ECC error on a memory DIMM. The ASR feature on a ProLiant server resets
the server when the timeout expires with no software intervention required.
Recommended
action
If this is the only ASR message present, this usually indicates a hardware error (such as an
unrecoverable memory error). Try moving the server memory DIMMs to different slots to
see if more information can be logged. Review all IML messages that previously occurred
to see if any other component has given an indication of failure or temperature limits that
might have exceeded normal operating thresholds.