Managing Serviceguard Fifteenth Edition, reprinted May 2008

Troubleshooting Your Cluster

Solving Problems

Chapter 8430

System Administration Errors

There are a number of errors you can make when configuring

Serviceguard that will not show up when you start the cluster. Your

cluster can be running, and everything appears to be fine, until there is a

hardware or software failure and control of your packages is not

transferred to another node as you would have expected.

These are errors caused specifically by errors in the cluster configuration

file and package configuration scripts. Examples of these errors include:

• Volume groups not defined on adoptive node.

• Mount point does not exist on adoptive node.

• Network errors on adoptive node (configuration errors).

• User information not correct on adoptive node.

You can use the following commands to check the status of your disks:

• bdf - to see if your package's volume group is mounted.

• vgdisplay -v - to see if all volumes are present.

• lvdisplay -v - to see if the mirrors are synchronized.

• strings /etc/lvmtab - to ensure that the configuration is correct.

• ioscan -fnC disk - to see physical disks.

• diskinfo -v /dev/rdsk/cxtydz - to display information about a

disk.

• lssf /dev/d*/* - to check logical volumes and paths.

• vxdg list - to list Veritas disk groups.

• vxprint - to show Veritas disk group details.

Package Control Script Hangs or Failures

When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and

the control script hangs, causing the timeout to be exceeded,

Serviceguard kills the script and marks the package “Halted.” Similarly,

when a package control script fails, Serviceguard kills the script and

marks the package “Halted.” In both cases, the following also take place:

• Control of a failover package will not be transferred.

• The run or halt instructions may not run to completion.