Managing Serviceguard Nineteenth Edition, Reprinted June 2011

Networking and Security Configuration Errors
Many Serviceguard commands, including cmviewcl, depend on name resolution services to look
up the addresses of cluster nodes. When name services are not available (for example, if a name
server is down), Serviceguard commands may hang, or may return a network-related error message.
If this happens, use the nslookup command on each cluster node to see whether name resolution
is correct. For example:
nslookup ftsys9
Name Server: server1.cup.hp.com
Address: 15.13.168.63
Name: ftsys9.cup.hp.com
Address: 15.13.172.229
If the output of this command does not include the correct IP address of the node, then check your
name resolution services further.
In many cases, a symptom such as Permission denied... or Connection refused...
is the result of an error in the networking or security configuration. Most such problems can be
resolved by correcting the entries in /etc/hosts. See “Configuring Name Resolution” (page 159)
for more information.
Cluster Re-formations Caused by Temporary Conditions
You may see Serviceguard error messages, such as the following, which indicate that a node is
having problems:
Member node_name seems unhealthy, not receiving heartbeats from it.
This may indicate a serious problem, such as a node failure, whose underlying cause is probably
a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the next section, “Cluster
Re-formations Caused by MEMBER_TIMEOUT Being Set too Low”. Or it may be a transitory problem,
such as excessive network traffic or system load.
What to do: If you find that cluster nodes are failing because of temporary network or system-load
problems (which in turn cause heartbeat messages to be delayed in network or during processing),
you should solve the networking or load problem if you can. Failing that, you can increase the
value of MEMBER_TIMEOUT, as described in the next section.
Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low
If you have set the MEMBER_TIMEOUT parameter too low, the cluster demon, cmcld, will write
warnings to syslog that indicate the problem. There are three in particular that you should watch
for:
1. Warning: cmcld was unable to run for the last <n.n> seconds. Consult
the Managing Serviceguard manual for guidance on setting
MEMBER_TIMEOUT, and information on cmcld.
This means that cmcld was unable to get access to a CPU for a significant amount of time.
If this occurred while the cluster was re-forming, one or more nodes could have failed. Some
commands (such as cmhaltnode (1m), cmrunnode (1m), cmapplyconf (1m)) cause
the cluster to re-form, so there's a chance that running one of these commands could precipitate
a node failure; that chance is greater the longer the hang.
What to do: If this message appears once a month or more often, increase MEMBER_TIMEOUT
to more than 10 times the largest reported delay. For example, if the message that reports the
320 Troubleshooting Your Cluster