Managing ProLiant servers with Linux HOWTO

5
1-1-1-2 System fan monitoring
A ProLiant server can contain fan sensors. On ProLiant servers with intelligent fan sensors, check the
status of the fans by running hplog -f.
If a cooling fan fails and there is no secondary redundant fan, the Health Monitor does the following:
Displays a message on the console stating the problem
Makes an entry in the system health log and the operating system log
Shuts down the system (optionally) to avoid hardware damage
Use RBSU to control the shutdown option.
If a secondary or redundant fan is present when a fan fails, the Health Monitor does the following:
Activates the redundant fan if not already running
Displays a message on the console stating the problem
Makes an entry in the system health log and the operating system log
1-1-1-3 Monitoring the system fault tolerant power supply
If the server contains a redundant power supply, the power load is shared equally between the power
supplies. Check the status of the power supplies by running hplog -p. If a primary power supply fails,
the server automatically switches over to a backup power supply. The Health Monitor does the
following:
Monitors the system for power failure and for physical presence of power supplies
Reports when the power supplies experience a change in shared power load
Displays a message on the console stating the problem
Makes an entry in the system health log and the operating system log
1-1-1-4 ECC memory monitoring and advanced memory protection
If a correctable ECC memory error occurs, the Health Monitor logs the error in the health log,
including the memory address causing the error. If too many errors occur at the same memory
location, the driver disables the ECC error interrupts to prevent flooding the console with warnings
(the hardware automatically corrects the ECC error).
On servers with AMP, the driver attempts to log an error if a memory board has been inserted,
removed, or incorrectly configured, and optionally if an Online Spare Switchover or Mirrored
Memory engaged event occurs.
The Health Monitor does the following:
Displays a message on the console stating the problem
Makes an entry in the system health log
This server feature is configured using RBSU. On ProLiant servers that do not support AMP mirroring,
an uncorrectable (double bit) memory error causes the operating system to halt abruptly. Logging of
the error might not be possible if the error occurs in memory used by the Health Monitor.
1-1-1-5 Automatic server recovery
Automatic Server Recovery (ASR) is configured using RBSU available during the initial boot of the
server by pressing the F9 key when prompted. This feature is implemented using a "heartbeat" timer
that continually counts down. The Health Monitor frequently reloads the counter to prevent it from
counting down to zero. If the ASR counts down to zero, it is assumed that the operating system has
locked up and the system automatically attempts to reboot.