Optimizing Failover Time in a Serviceguard Environment, June 2007
Election of Cluster Membership
After a node believes another node has failed, it begins the cluster re-formation process. The cluster
nodes elect the nodes that will be members of a newly re-formed cluster.
Each healthy node tries to take over the work of the cluster. It tries to change cluster membership to
include the nodes it can communicate with and exclude the nodes it cannot reach.
If one node has failed but all the others can still communicate with each other, the others quickly
form a group that excludes that node. However, it could be that a group of healthy nodes cannot
communicate with some other healthy nodes. In this case, several groups could try to form a cluster.
The group that achieves quorum will become the new cluster. There are two ways for a group to
achieve quorum:
• If the group includes more than half of the nodes that were active the last time the cluster was
formed, it has quorum because it has the majority.
• If two groups each have exactly half of the nodes that were active the last time the cluster was
formed, the group that acquires the cluster lock achieves quorum. (See “Lock acquisition” for more
about the cluster lock.)
The group that achieves quorum takes over the work of the cluster. The excluded nodes are not
allowed to proceed in cluster re-formation, and they will be rebooted.
The amount of time taken for election depends primarily on the values for the NODE_TIMEOUT and
HEARTBEAT_INTERVAL. If there are a number of temporary interruptions and recoveries, nodes may
lose contact and then resume contact several times. If so, the process will take longer because of
repeated elections.
Lock Acquisition
If two equal-sized groups try to re-form the cluster, the cluster lock acts as arbitrator or tie-breaker.
Whichever group acquires the cluster lock will achieve quorum and form the new cluster membership.
Serviceguard uses three types of cluster locks:
• Quorum server (HP-UX and Linux
®
)
• LVM Lock disk (HP-UX only)
• Lock disk LUN (HP-UX and Linux
®
)
• A two-node cluster is required to have a cluster lock. In clusters of three or more nodes, a lock is
strongly recommended. The lock disks can be used for clusters with two, three, or four nodes. A
quorum server can be used on a cluster of any size.
As of Serviceguard 11.18, acquiring a lock takes about the same amount of time whichever type of
lock is used.
Quiescence
Quiescence is a quiet waiting time after new cluster membership is determined. Nodes that are not in
the new membership are forcibly rebooted. The waiting time is a protection against data corruption.
Its purpose is to make sure that the reboot finishes so an excluded node is not trying to run a package
or issue any I/O.
Quiescence is important when some nodes in the cluster cannot communicate with the others but
could still run applications, particularly if the nodes have access to a common database.
Quiescence is calculated by Serviceguard, and the user cannot directly change it.
4