Managing Serviceguard Sixteenth Edition, March 2009

Failure. Only one LAN has been configured for both heartbeat and data traffic. During
the course of operations, heavy application traffic monopolizes the bandwidth of the
network, preventing heartbeat packets from getting through.
Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts
to reform as a one-node cluster. Likewise, since SystemB does not receive heartbeat
messages from SystemA, SystemB also attempts to reform as a one-node cluster.
During the election protocol, each node votes for itself, giving both nodes 50 percent
of the vote. Because both nodes have 50 percent of the vote, both nodes now vie for the
cluster lock. Only one node will get the lock.
Outcome. Assume SystemA gets the cluster lock. SystemA reforms as a one-node
cluster. After re-formation, SystemA will make sure all applications configured to run
on an existing cluster node are running. When SystemA discovers Package2 is not
running in the cluster it will try to start Package2 if Package2 is configured to run
on SystemA.
SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the
cluster. To release all resources related to Package2 (such as exclusive access to volume
group vg02 and the Package2 IP address) as quickly as possible, SystemB halts
(system reset).
NOTE: If AUTOSTART_CMCLD in /etc/rc.conf.d/cmcluster ($SGAUTOSTART)
is set to zero, the node will not attempt to join the cluster when it comes back up.
For more information on cluster failover, see the white paper Optimizing Failover Time
in a Serviceguard Environment (version A.11.19 and later) at http://www.docs.hp.com
-> High Availability -> Serviceguard -> White Papers. For
troubleshooting information, see “Cluster Re-formations Caused by
MEMBER_TIMEOUT Being Set too Low” (page 376).
Responses to Hardware Failures
If a serious system problem occurs, such as a system panic or physical disruption of
the SPU's circuits, Serviceguard recognizes a node failure and transfers the failover
packages currently running on that node to an adoptive node elsewhere in the cluster.
(System multi-node and multi-node packages do not fail over.)
The new location for each failover package is determined by that package's configuration
file, which lists primary and alternate nodes for the package. Transfer of a package to
another node does not transfer the program counter. Processes in a transferred package
will restart from the beginning. In order for an application to be swiftly restarted after
a failure, it must be “crash-tolerant”; that is, all processes in the package must be written
so that they can detect such a restart. This is the same application design required for
restart after a normal system crash.
In the event of a LAN interface failure, a local switch is done to a standby LAN interface
if one exists. If a heartbeat LAN interface fails and no standby or redundant heartbeat
Responses to Failures 119