Managing Serviceguard A.11.20, March 2013

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

331

332

333

334

335

336

337

338

339

340

• Package control script hangs

• Problems with VxVM disk groups

• Package movement errors

• Node and network failures

• Quorum Server problems

The first two categories of problems occur with the incorrect configuration of Serviceguard. The

last category contains “normal” failures to which Serviceguard is designed to react and ensure

the availability of your applications.

Serviceguard Command Hangs

If you are having trouble starting Serviceguard, it is possible that someone has accidentally deleted,

modified, or placed files in, the directory that is reserved for Serviceguard use only: $SGCONF/

rc, by default /etc/cmcluster/rc on an HP-UX system.

Networking and Security Configuration Errors

Many Serviceguard commands, including cmviewcl, depend on name resolution services to look

up the addresses of cluster nodes. When name services are not available (for example, if a name

server is down), Serviceguard commands may hang, or may return a network-related error message.

If this happens, use the nslookup command on each cluster node to see whether name resolution

is correct. For example:

nslookup ftsys9

Name Server: server1.cup.hp.com

Address: 15.13.168.63

Name: ftsys9.cup.hp.com

Address: 15.13.172.229

If the output of this command does not include the correct IP address of the node, then check your

name resolution services further.

In many cases, a symptom such as Permission denied... or Connection refused...

is the result of an error in the networking or security configuration. Most such problems can be

resolved by correcting the entries in /etc/hosts. See “Configuring Name Resolution” (page 173)

for more information.

Cluster Re-formations Caused by Temporary Conditions

You may see Serviceguard error messages, such as the following, which indicate that a node is

having problems:

Member node_name seems unhealthy, not receiving heartbeats from it.

This may indicate a serious problem, such as a node failure, whose underlying cause is probably

a too-aggressive setting for the MEMBER_TIMEOUT parameter; see the next section, “Cluster

Re-formations Caused by MEMBER_TIMEOUT Being Set too Low”. Or it may be a transitory problem,

such as excessive network traffic or system load.

What to do: If you find that cluster nodes are failing because of temporary network or system-load

problems (which in turn cause heartbeat messages to be delayed in network or during processing),

you should solve the networking or load problem if you can. Failing that, you can increase the

value of MEMBER_TIMEOUT, as described in the next section.

Solving Problems 339