Managing Serviceguard Fifteenth Edition, reprinted May 2008

Troubleshooting Your Cluster

Solving Problems

Chapter 8 435

Adding a set -x statement in the second line of your control script will

cause additional details to be logged into the package log file, which can

give you more information about where your script may be failing.

Node and Network Failures

These failures cause Serviceguard to transfer control of a package to

another node. This is the normal action of Serviceguard, but you have to

be able to recognize when a transfer has taken place and decide to leave

the cluster in its current condition or to restore it to its original

condition.

Possible node failures can be caused by the following conditions:

• HPMC. This is a High Priority Machine Check, a system panic

caused by a hardware error.

•TOC

•Panics

•Hangs

• Power failures

In the event of a TOC, a system dump is performed on the failed node

and numerous messages are also displayed on the console.

You can use the following commands to check the status of your network

and subnets:

• netstat -in - to display LAN status and check to see if the package

IP is stacked on the LAN card.

• lanscan - to see if the LAN is on the primary interface or has

switched to the standby interface.

• arp -a - to check the arp tables.

• lanadmin - to display, test, and reset the LAN cards.

Since your cluster is unique, there are no cookbook solutions to all

possible problems. But if you apply these checks and commands and

work your way through the log files, you will be successful in identifying

and solving problems.