Managing Serviceguard 13th Edition, February 2007

Troubleshooting Your Cluster
Monitoring Hardware
Chapter 8 347
Monitoring Hardware
Good standard practice in handling a high availability system includes
careful fault monitoring so as to prevent failures if possible or at least to
react to them swiftly when they occur. The following should be monitored
for errors or warnings of all kinds:
•Disks
•CPUs
Memory
LAN cards
Power sources
All cables
Disk interface cards
Some monitoring can be done through simple physical inspection, but for
the most comprehensive monitoring, you should examine the system log
file (/var/adm/syslog/syslog.log) periodically for reports on all configured
HA devices. The presence of errors relating to a device will show the need
for maintenance.
When the proper redundancy has been configured, failures can occur
with no external symptoms. Proper monitoring is important. For
example, if a Fibre Channel switch in a redundant mass storage
configuration fails, LVM will automatically fail over to the alternate path
through another Fibre Channel switch. Without monitoring, however,
you may not know that the failure has occurred, since the applications
are still running normally. But at this point, there is no redundant path
if another failover occurs, so the mass storage configuration is
vulnerable.
Using Event Monitoring Service
Event Monitoring Service (EMS) allows you to configure monitors of
specific devices and system resources. You can direct alerts to an
administrative workstation where operators can be notified of further