Managing Serviceguard Seventeenth Edition, First Reprint December 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

381

382

383

384

385

386

387

388

389

390

What to do: If this message appears once a month or more often, increase

MEMBER_TIMEOUT to more than 10 times the largest reported delay. For example,

if the message that reports the largest number says that cmcld was unable to run

for the last 1.6 seconds, increase MEMBER_TIMEOUT to more than 16 seconds.

2. This node is at risk of being evicted from the running

cluster. Increase MEMBER_TIMEOUT.

This means that the hang was long enough for other nodes to have noticed the

delay in receiving heartbeats and marked the node “unhealthy”. This is the

beginning of the process of evicting the node from the cluster; see “What Happens

when a Node Times Out” (page 117) for an explanation of that process.

What to do: In isolation, this could indicate a transitory problem, as described in

the previous section. If you have diagnosed and fixed such a problem and are

confident that it won't recur, you need take no further action; otherwise you should

increase MEMBER_TIMEOUT as instructed in item 1.

3. Member node_name seems unhealthy, not receiving heartbeats

from it.

This is the message that indicates that the node has been found “unhealthy” as

described in the previous bullet.

What to do: See item 2.

For more information, including requirements and recommendations, see the

MEMBER_TIMEOUT discussion under “Cluster Configuration Parameters ” (page 139).

See also “Modifying the MEMBER_TIMEOUT Parameter” (page 227) and “Cluster

Daemon: cmcld” (page 55).

System Administration Errors

There are a number of errors you can make when configuring Serviceguard that will

not show up when you start the cluster. Your cluster can be running, and everything

appears to be fine, until there is a hardware or software failure and control of your

packages is not transferred to another node as you would have expected.

These are errors caused specifically by errors in the cluster configuration file and

package configuration scripts. Examples of these errors include:

• Volume groups not defined on adoptive node.

• Mount point does not exist on adoptive node.

• Network errors on adoptive node (configuration errors).

• User information not correct on adoptive node.

You can use the following commands to check the status of your disks:

• bdf - to see if your package's volume group is mounted.

• vgdisplay -v - to see if all volumes are present.

• lvdisplay -v - to see if the mirrors are synchronized.

Solving Problems 381