Managing Serviceguard 13th Edition, February 2007

Troubleshooting Your Cluster

Monitoring Hardware

Chapter 8 347

Monitoring Hardware

Good standard practice in handling a high availability system includes

careful fault monitoring so as to prevent failures if possible or at least to

react to them swiftly when they occur. The following should be monitored

for errors or warnings of all kinds:

•Disks

•CPUs

• Memory

• LAN cards

• Power sources

• All cables

• Disk interface cards

Some monitoring can be done through simple physical inspection, but for

the most comprehensive monitoring, you should examine the system log

file (/var/adm/syslog/syslog.log) periodically for reports on all configured

HA devices. The presence of errors relating to a device will show the need

for maintenance.

When the proper redundancy has been configured, failures can occur

with no external symptoms. Proper monitoring is important. For

example, if a Fibre Channel switch in a redundant mass storage

configuration fails, LVM will automatically fail over to the alternate path

through another Fibre Channel switch. Without monitoring, however,

you may not know that the failure has occurred, since the applications

are still running normally. But at this point, there is no redundant path

if another failover occurs, so the mass storage configuration is

vulnerable.

Using Event Monitoring Service

Event Monitoring Service (EMS) allows you to configure monitors of

specific devices and system resources. You can direct alerts to an

administrative workstation where operators can be notified of further