Managing Serviceguard A.11.20, March 2013

Package control script hangs
Problems with VxVM disk groups
Package movement errors
Node and network failures
Quorum Server problems
The first two categories of problems occur with the incorrect configuration of Serviceguard. The
last category contains “normal” failures to which Serviceguard is designed to react and ensure
the availability of your applications.
Serviceguard Command Hangs
If you are having trouble starting Serviceguard, it is possible that someone has accidentally deleted,
modified, or placed files in, the directory that is reserved for Serviceguard use only: $SGCONF/
rc, by default /etc/cmcluster/rc on an HP-UX system.
Networking and Security Configuration Errors
Many Serviceguard commands, including cmviewcl, depend on name resolution services to look
up the addresses of cluster nodes. When name services are not available (for example, if a name
server is down), Serviceguard commands may hang, or may return a network-related error message.
If this happens, use the nslookup command on each cluster node to see whether name resolution
is correct. For example:
nslookup ftsys9
Name Server: server1.cup.hp.com
Address: 15.13.168.63
Name: ftsys9.cup.hp.com
Address: 15.13.172.229
If the output of this command does not include the correct IP address of the node, then check your
name resolution services further.
In many cases, a symptom such as Permission denied... or Connection refused...
is the result of an error in the networking or security configuration. Most such problems can be
resolved by correcting the entries in /etc/hosts. See “Configuring Name Resolution (page 173)
for more information.
Cluster Re-formations Caused by Temporary Conditions
You may see Serviceguard error messages, such as the following, which indicate that a node is
having problems:
Member node_name seems unhealthy, not receiving heartbeats from it.
This may indicate a serious problem, such as a node failure, whose underlying cause is probably
a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the next section, “Cluster
Re-formations Caused by MEMBER_TIMEOUT Being Set too Low”. Or it may be a transitory problem,
such as excessive network traffic or system load.
What to do: If you find that cluster nodes are failing because of temporary network or system-load
problems (which in turn cause heartbeat messages to be delayed in network or during processing),
you should solve the networking or load problem if you can. Failing that, you can increase the
value of MEMBER_TIMEOUT, as described in the next section.
Solving Problems 339