HP Insight Management Agents 8.40 Managing ProLiant Servers with Linux HOW TO Whitepaper

Managing ProLiant Servers with Linux HOWTO v8.40
6
Makes an entry in the system health log and the operating system log
Shuts down the system (optionally) to avoid hardware damage Use RBSU to control the shutdown
option.
If a secondary or redundant fan is present when a fan fails, the Health Monitor does the following:
Activates the redundant fan if not already running
Displays a message on the console stating the problem
Makes an entry in the system health log and the operating system log
1-1-1-3 Monitoring the system fault tolerant power supply
If the server contains a redundant power supply, the power load is shared equally between the power
supplies. Check the status of the power supplies by running hplog -p. If a primary power supply fails, the
server automatically switches over to a backup power supply. The Health Monitor does the following:
Monitors the system for power failure and for physical presence of power supplies
Reports when the power supplies experience a change in shared power load
Displays a message on the console stating the problem
Makes an entry in the system health log and the operating system log
1-1-1-4 ECC memory monitoring and advanced memory protection
If a correctable ECC memory error occurs, the Health Monitor logs the error in the health log, including the
memory address causing the error. If too many errors occur at the same memory location, the driver
disables the ECC error interrupts to prevent flooding the console with warnings (the hardware
automatically corrects the ECC error).
On servers with AMP, the driver attempts to log an error if a memory board has been inserted, removed, or
incorrectly configured, and optionally if an Online Spare Switchover or Mirrored Memory engaged event
occurs.
The Health Monitor does the following:
Displays a message on the console stating the problem
Makes an entry in the system health log
This server feature is configured using RBSU. On ProLiant servers that do not support AMP mirroring, an
uncorrectable (double bit) memory error causes the operating system to halt abruptly. Logging of the error
might not be possible if the error occurs in memory used by the Health Monitor.
1-1-1-5 Automatic server recovery
Automatic Server Recovery (ASR) is configured using RBSU available during the initial boot of the server by
pressing the F9 key when prompted. This feature is implemented using a "heartbeat" timer that continually
counts down. The Health Monitor frequently reloads the counter to prevent it from counting down to zero. If
the ASR counts down to zero, it is assumed that the operating system has locked up and the system
automatically attempts to reboot. Events that can contribute to the operating system locking up include:
A peripheral device, such as a Peripheral Component Interconnect Specification (PCI) adapter,
generates numerous spurious interrupts when it fails.
A high priority software application consumes all the available central processing unit (CPU) cycles and
does not allow the operating system scheduler to run the ASR timer reset process.