Managing Serviceguard Eighteenth Edition, September 2010

Responding to Cluster Events

Serviceguard does not require much ongoing system administration intervention. As

long as there are no failures, your cluster will be monitored and protected. In the event

of a failure, those packages that you have designated to be transferred to another node

will be transferred automatically. Your ongoing responsibility as the system

administrator will be to monitor the cluster and determine if a transfer of package has

occurred. If a transfer has occurred, you have to determine the cause and take corrective

actions.

The Event Monitoring Service and its HA monitors can provide monitoring for disks,

LAN cards, and some system events. Refer to the manual Using HA Monitors for more

information.

The typical corrective actions to take in the event of a transfer of package include:

• Determining when a transfer has occurred.

• Determining the cause of a transfer.

• Repairing any hardware failures.

• Correcting any software problems.

• Restarting nodes.

• Transferring packages back to their original nodes.

Single-Node Operation

In a multi-node cluster, you could have a situation in which all but one node has failed,

or you have shut down all but one node, leaving your cluster in single-node operation.

This remaining node will probably have applications running on it. As long as the

Serviceguard daemon cmcld is active, other nodes can rejoin the cluster.

If the Serviceguard daemon fails when in single-node operation, it will leave the single

node up and your applications running. (This is different from the loss of the

Serviceguard daemon in a multi-node cluster, which halts the node with a TOC, and

causes packages to be switched to adoptive nodes.) It is not necessary to halt the single

node in this scenario, since the application is still running, and no other node is currently

available for package switching.

You should not try to restart Serviceguard, since data corruption might occur if another

node were to attempt to start up a new instance of the application that is still running

on the single node.

Instead of restarting the cluster, choose an appropriate time to shut down the

applications and reboot the node; this will allow Serviceguard to restart the cluster

after the reboot.

Responding to Cluster Events 397