Managing HP Serviceguard A.12.00.00 for Linux, June 2014
• Package Movement Errors.
• Node and Network Failures.
• Quorum Server Messages.
10.8.1 Name Resolution Problems
Many Serviceguard commands, including cmviewcl, depend on name resolution services to look
up the addresses of cluster nodes. When name services are not available (for example, if a name
server is down), Serviceguard commands may hang, or may return a network-related error message.
If this happens, use the host command on each cluster node to see whether name resolution is
correct. For example:
host ftsys9
ftsys9.cup.hp.com has address 15.13.172.229
If the output of this command does not include the correct IP address of the node, then check your
name resolution services further.
10.8.1.1 Networking and Security Configuration Errors
In many cases, a symptom such as Permission denied... or Connection refused...
is the result of an error in the networking or security configuration. Most such problems can be
resolved by correcting the entries in /etc/hosts. See “Configuring Name Resolution” (page 145)
for more information.
10.8.2 Halting a Detached Package
When you attempt to halt a detached package using the cmhaltpkg and the given node is not
reachable, you will get an error message as follows:
Unable to halt the detached package <package_name> on node <node_name>
as the node is not reachable. Retry once the node is reachable.
In such a case, the node should be powered up and be accessible. You must then rerun the
cmhaltpkg command.
10.8.3 Cluster Re-formations Caused by Temporary Conditions
You may see Serviceguard error messages, such as the following, which indicate that a node is
having problems:
Member node_name seems unhealthy, not receiving heartbeats from it.
This may indicate a serious problem, such as a node failure, whose underlying cause is probably
a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the next section, “Cluster
Re-formations Caused by MEMBER_TIMEOUT Being Set too Low”. Or it may be a transitory problem,
such as excessive network traffic or system load.
What to do: If you find that cluster nodes are failing because of temporary network or system-load
problems (which in turn cause heartbeat messages to be delayed in network or during processing),
you should solve the networking or load problem if you can. Failing that, you can increase the
value of MEMBER_TIMEOUT, as described in the next section.
10.8.4 Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low
If you have set the MEMBER_TIMEOUT parameter too low, the cluster demon, cmcld, will write
warnings to syslog that indicate the problem. There are three in particular that you should watch
for:
10.8 Solving Problems 277