Managing HP Serviceguard A.12.00.00 for Linux, June 2014

ManualsBrandsHP ManualsSoftwareHP SAP Linux Serviceguard Cluster Extension

271

272

273

274

275

276

277

278

279

280

1. Warning: cmcld was unable to run for the last <n.n> seconds. Consult

the Managing Serviceguard manual for guidance on setting

MEMBER_TIMEOUT, and information on cmcld.

This means that cmcld was unable to get access to a CPU for a significant amount of time.

If this occurred while the cluster was re-forming, one or more nodes could have failed. Some

commands (such as cmhaltnode (1m), cmrunnode (1m), cmapplyconf (1m)), cause

the cluster to re-form, so there's a chance that running one of these commands could precipitate

a node failure; that chance is greater the longer the hang.

What to do: If this message appears once a month or more often, increase MEMBER_TIMEOUT

to more than 10 times the largest reported delay. For example, if the message that reports the

largest number says that cmcld was unable to run for the last 1.6 seconds, increase

MEMBER_TIMEOUT to more than 16 seconds.

2. This node is at risk of being evicted from the running cluster.

Increase MEMBER_TIMEOUT.

This means that the hang was long enough for other nodes to have noticed the delay in

receiving heartbeats and marked the node “unhealthy”. This is the beginning of the process

of evicting the node from the cluster; see “What Happens when a Node Times Out” (page 73)

for an explanation of that process.

What to do: In isolation, this could indicate a transitory problem, as described in the previous

section. If you have diagnosed and fixed such a problem and are confident that it won't recur,

you need take no further action; otherwise you should increase MEMBER_TIMEOUT as instructed

in item 1.

3. Member node_name seems unhealthy, not receiving heartbeats from it.

This is the message that indicates that the node has been found “unhealthy” as described in

the previous bullet.

What to do: See item 2.

For more information, including requirements and recommendations, see the MEMBER_TIMEOUT

discussion under “Cluster Configuration Parameters” (page 89).

10.8.5 System Administration Errors

There are a number of errors you can make when configuring Serviceguard that will not show up

when you start the cluster. Your cluster can be running, and everything appears to be fine, until

there is a hardware or software failure and control of your packages are not transferred to another

node as you would have expected.

These are errors caused specifically by errors in the cluster configuration file and package

configuration scripts. Examples of these errors include:

• Volume groups not defined on adoptive node.

• Mount point does not exist on adoptive node.

• Network errors on adoptive node (configuration errors).

• User information not correct on adoptive node.

You can use the following commands to check the status of your disks:

• df - to see if your package’s volume group is mounted.

• vgdisplay -v - to see if all volumes are present.

• strings /etc/lvmconf/*.conf - to ensure that the configuration is correct.

• fdisk -v /dev/sdx - to display information about a disk.

278 Troubleshooting Your Cluster