Managing Serviceguard Eighteenth Edition, September 2010
confident that it won't recur, you need take no further action; otherwise you should
increase MEMBER_TIMEOUT as instructed in item 1.
3. Member node_name seems unhealthy, not receiving heartbeats
from it.
This is the message that indicates that the node has been found “unhealthy” as
described in the previous bullet.
What to do: See item 2.
For more information, including requirements and recommendations, see the
MEMBER_TIMEOUT discussion under “Cluster Configuration Parameters ” (page 143).
See also “Modifying the MEMBER_TIMEOUT Parameter” (page 251) and “Cluster
Daemon: cmcld” (page 55).
System Administration Errors
There are a number of errors you can make when configuring Serviceguard that will
not show up when you start the cluster. Your cluster can be running, and everything
appears to be fine, until there is a hardware or software failure and control of your
packages is not transferred to another node as you would have expected.
These are errors caused specifically by errors in the cluster configuration file and
package configuration scripts. Examples of these errors include:
• Volume groups not defined on adoptive node.
• Mount point does not exist on adoptive node.
• Network errors on adoptive node (configuration errors).
• User information not correct on adoptive node.
You can use the following commands to check the status of your disks:
• bdf - to see if your package's volume group is mounted.
• vgdisplay -v - to see if all volumes are present.
• lvdisplay -v - to see if the mirrors are synchronized.
• strings /etc/lvmtab - to ensure that the configuration is correct.
• ioscan -fnC disk - to see physical disks.
• diskinfo -v /dev/rdsk/cxtydz - to display information about a disk.
• lssf /dev/d*/* - to check logical volumes and paths.
• vxdg list - to list Veritas disk groups.
• vxprint- to show Veritas disk group details.
Package Control Script Hangs or Failures
When a RUN_SCRIPT_TIMEOUT or HALT_SCRIPT_TIMEOUT value is set, and the
control script hangs, causing the timeout to be exceeded, Serviceguard kills the script
and marks the package “Halted.” Similarly, when a package control script fails,
416 Troubleshooting Your Cluster