HP Insight Management Agents 9.20 Managing ProLiant Servers with Linux HOW TO Whitepaper
The Health Monitor does the following:
• Displays a message on the console stating the problem
• Makes an entry in the system health log
This server feature is configured via RBSU. On ProLiant servers that do not support AMP mirroring,
an uncorrectable (double bit) memory error causes the operating system to halt abruptly. Logging
of the error might not be possible if the error occurs in memory that the Health Monitor uses.
Automatic server recovery
You can configure Automatic Server Recovery (ASR) by using RBSU during the initial boot of the
server, by pressing the F9 key when prompted. This feature is implemented via a "heartbeat" timer
that continually counts down. The Health Monitor frequently reloads the counter to prevent it from
counting down to zero. If the ASR counts down to zero, it is assumed that the operating system
has locked up and the system automatically attempts to reboot. Events that can contribute to the
operating system locking up includes the following:
• A peripheral device, such as a PCI adapter, generates numerous spurious interrupts when it
fails.
• A high-priority software application consumes all the available CPU cycles and does not allow
the operating system scheduler to run the ASR timer reset process.
• A software or kernel application consumes all available memory, including the virtual memory
space (for example, swap). This can cause the operating system scheduler to cease functioning.
• A critical operating system component, such as a file system, fails and causes the operating
system scheduler to cease functioning.
• Any event other than an ASR timeout generates a Non-Maskable Interrupt (NMI). The ASR
feature is a hardware-based timer.
If a true hardware failure occurs, the Health Monitor might not be called, but the server resets as
if the power switch was pressed. The ProLiant ROM code might log an event to the IML when
the server reboots.
The Health Monitor is notified of an ASR timeout through an NMI. If possible, the driver attempts
to perform the following actions:
• Displays a message on the console stating the problem
• Makes an entry in the IML
• Attempts to gracefully shut down the operating system to close the file systems
There is no guarantee that the operating system will gracefully shut down. This shutdown depends
on the type of error condition (software or hardware) and its severity. The Health Monitor logs a
series of messages when an ASR event occurs. The presence or absence of these messages can
provide some insight into the reason for the ASR event. The order of the messages is important,
because the ASR event is always a symptom of another error condition.
Console messages
When events occur outside normal operations, the Health Monitor might display a console message
or log a message to the IML. Operational messages, such as fan failures or temperature violations,
are logged to the standard /var/log/messages file. Messages specific to device drivers (such
as NMI type messages) can be viewed via dmesg, if the system is not completely locked up.
The hp-health manpage documents can interpret the messages that the Health Monitor produces.
HP Integrated Management Logging Utility (hplog)
The HP ProLiant Integrated Management Logging utility (hplog) allows system administrators to
view IML pages. Commands are listed in Table 3: “hplog options”.
HP System Health Application and Command Line Utilities(hp-health) 9