Managing Serviceguard Seventeenth Edition, First Reprint December 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

371

372

373

374

375

376

377

378

379

380

Name Server: server1.cup.hp.com

Address: 15.13.168.63

Name: ftsys9.cup.hp.com

Address: 15.13.172.229

If the output of this command does not include the correct IP address of the node, then

check your name resolution services further.

In many cases, a symptom such as Permission denied... or Connection

refused... is the result of an error in the networking or security configuration. Most

such problems can be resolved by correcting the entries in /etc/hosts. See

“Configuring Name Resolution” (page 199) for more information.

Cluster Re-formations Caused by Temporary Conditions

You may see Serviceguard error messages, such as the following, which indicate that

a node is having problems:

Member node_name seems unhealthy, not receiving heartbeats from

it.

This may indicate a serious problem, such as a node failure, whose underlying cause

is probably a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the

next section, “Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too

Low”. Or it may be a transitory problem, such as excessive network traffic or system

load.

What to do: If you find that cluster nodes are failing because of temporary network or

system-load problems (which in turn cause heartbeat messages to be delayed in network

or during processing), you should solve the networking or load problem if you can.

Failing that, you can increase the value of MEMBER_TIMEOUT, as described in the

next section.

Cluster Re-formations Caused by MEMBER_TIMEOUT Being Set too Low

If you have set the MEMBER_TIMEOUT parameter too low, the cluster demon, cmcld,

will write warnings to syslog that indicate the problem. There are three in particular

that you should watch for:

1. Warning: cmcld was unable to run for the last <n.n> seconds.

Consult the Managing Serviceguard manual for guidance on

setting MEMBER_TIMEOUT, and information on cmcld.

This means that cmcld was unable to get access to a CPU for a significant amount

of time. If this occurred while the cluster was re-forming, one or more nodes could

have failed. Some commands (such as cmhaltnode (1m), cmrunnode (1m),

cmapplyconf (1m)) cause the cluster to re-form, so there's a chance that running

one of these commands could precipitate a node failure; that chance is greater the

longer the hang.

380 Troubleshooting Your Cluster