Managing Serviceguard Sixteenth Edition, March 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

111

112

113

114

115

116

117

118

119

120

Failure. Only one LAN has been configured for both heartbeat and data traffic. During

the course of operations, heavy application traffic monopolizes the bandwidth of the

network, preventing heartbeat packets from getting through.

Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts

to reform as a one-node cluster. Likewise, since SystemB does not receive heartbeat

messages from SystemA, SystemB also attempts to reform as a one-node cluster.

During the election protocol, each node votes for itself, giving both nodes 50 percent

of the vote. Because both nodes have 50 percent of the vote, both nodes now vie for the

cluster lock. Only one node will get the lock.

Outcome. Assume SystemA gets the cluster lock. SystemA reforms as a one-node

cluster. After re-formation, SystemA will make sure all applications configured to run

on an existing cluster node are running. When SystemA discovers Package2 is not

running in the cluster it will try to start Package2 if Package2 is configured to run

on SystemA.

SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the

cluster. To release all resources related to Package2 (such as exclusive access to volume

group vg02 and the Package2 IP address) as quickly as possible, SystemB halts

(system reset).

NOTE: If AUTOSTART_CMCLD in /etc/rc.conf.d/cmcluster ($SGAUTOSTART)

is set to zero, the node will not attempt to join the cluster when it comes back up.

For more information on cluster failover, see the white paper Optimizing Failover Time

in a Serviceguard Environment (version A.11.19 and later) at http://www.docs.hp.com

-> High Availability -> Serviceguard -> White Papers. For

troubleshooting information, see “Cluster Re-formations Caused by

MEMBER_TIMEOUT Being Set too Low” (page 376).

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical disruption of

the SPU's circuits, Serviceguard recognizes a node failure and transfers the failover

packages currently running on that node to an adoptive node elsewhere in the cluster.

(System multi-node and multi-node packages do not fail over.)

The new location for each failover package is determined by that package's configuration

file, which lists primary and alternate nodes for the package. Transfer of a package to

another node does not transfer the program counter. Processes in a transferred package

will restart from the beginning. In order for an application to be swiftly restarted after

a failure, it must be “crash-tolerant”; that is, all processes in the package must be written

so that they can detect such a restart. This is the same application design required for

restart after a normal system crash.

In the event of a LAN interface failure, a local switch is done to a standby LAN interface

if one exists. If a heartbeat LAN interface fails and no standby or redundant heartbeat

Responses to Failures 119