Managing Serviceguard Eighteenth Edition, September 2010

This may indicate a serious problem, such as a node failure, whose underlying cause
is probably a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the
next section, “Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too
Low”. Or it may be a transitory problem, such as excessive network traffic or system
load.
What to do: If you find that cluster nodes are failing because of temporary network or
system-load problems (which in turn cause heartbeat messages to be delayed in network
or during processing), you should solve the networking or load problem if you can.
Failing that, you can increase the value of MEMBER_TIMEOUT, as described in the
next section.
Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low
If you have set the MEMBER_TIMEOUT parameter too low, the cluster demon, cmcld,
will write warnings to syslog that indicate the problem. There are three in particular
that you should watch for:
1. Warning: cmcld was unable to run for the last <n.n> seconds.
Consult the Managing Serviceguard manual for guidance on
setting MEMBER_TIMEOUT, and information on cmcld.
This means that cmcld was unable to get access to a CPU for a significant amount
of time. If this occurred while the cluster was re-forming, one or more nodes could
have failed. Some commands (such as cmhaltnode (1m), cmrunnode (1m),
cmapplyconf (1m)) cause the cluster to re-form, so there's a chance that running
one of these commands could precipitate a node failure; that chance is greater the
longer the hang.
What to do: If this message appears once a month or more often, increase
MEMBER_TIMEOUT to more than 10 times the largest reported delay. For example,
if the message that reports the largest number says that cmcld was unable to run
for the last 1.6 seconds, increase MEMBER_TIMEOUT to more than 16 seconds.
2. This node is at risk of being evicted from the running
cluster. Increase MEMBER_TIMEOUT.
This means that the hang was long enough for other nodes to have noticed the
delay in receiving heartbeats and marked the node “unhealthy”. This is the
beginning of the process of evicting the node from the cluster; see “What Happens
when a Node Times Out” (page 117) for an explanation of that process.
What to do: In isolation, this could indicate a transitory problem, as described in
the previous section. If you have diagnosed and fixed such a problem and are
Solving Problems 415