Installation guide

Section B.3:Failover and Recovery Scenarios 173

B.3 Failover and Recovery Scenarios

Understanding cluster behavior when significant events occur can assist in the proper management of

a cluster. Note that cluster behavior depends on whether power switches are employed in the con-

figuration. Power switches enable the cluster to maintain complete data integrity under all failure

conditions.

The following sections describe how the system will respond to various failure and error scenarios.

B.3.1 System Hang

In a cluster configuration that uses power switches, if a system hangs, the cluster behaves as follows:

1. The functional cluster system detects that the hung cluster system is not updating its timestamp on

the quorum partitions and is not communicating over the heartbeat channels.

2. The functional cluster system power-cycles the hung system. Alternatively, if watchdog timers are

in use, a failed system will reboot itself.

3. The functional cluster system restarts any services that were running on the hung system.

4. If the previously hung system reboots, and can join the cluster (that is, the system can write to

both quorum partitions), services are re-balanced across the member systems, according to each

service’s placement policy.

In a cluster configuration that does not use power switches, if a system hangs, the cluster behaves as

follows:

1. The functional cluster system detects that the hung cluster system is not updating its timestamp on

the quorum partitions and is not communicating over the heartbeat channels.

2. Optionally, if watchdog timers are used, the failed system will reboot itself.

3. The functional cluster system sets the status of the hung system to

DOWN on the quorum partitions,

and then restarts the hung system’s services.

4. If the hung system becomes active, it notices that its status is

DOWN, and initiates a system reboot.

If the system remains hung, manually power-cycle the hung system in order for it to resume cluster

operation.

5. If the previously hung system reboots, and can join the cluster, services are re-balanced across the

member systems, according to each service’s placement policy.