Arbitration For Data Integrity in Serviceguard Clusters, July 2007

Arbitration for Data Integrity in Serviceguard Clusters
Cluster Membership Concepts
6
Quorum
Cluster re-formation takes place when there is some change in the
cluster membership. In general, the algorithm for cluster re-formation
requires the new cluster to achieve a cluster quorum of a strict
majority (that is, more than 50%) of the nodes previously running. If
both halves (exactly 50%) of a previously running cluster were allowed to
re-form, there would be a split-brain situation in which two instances of
the same cluster were running.
Split-Brain
How could a split-brain situation arise? Suppose a two-node cluster
experiences the loss of all network connections between the nodes. This
means that cluster heartbeat ceases. Each node will then try to re-form
the cluster separately. If this were allowed to occur, it would have the
potential to run the same application in two different locations and to
corrupt application data. In a split-brain scenario, different incarnations
of an application could end up simultaneously accessing the same disks.
One incarnation might well be initiating recovery activity while the
other is modifying the state of the disks. Serviceguard’s quorum
requirement is designed to prevent a split-brain situation.
How likely is a split-brain situation? Partly, the answer to this depends
on the types of intra-node communication the cluster is using: some
types are more robust than others. For example, the use of the older
coaxial cable technology makes communication loss a significant
problem. In that technology, the loss of termination would frequently
result in the loss of an entire LAN. On the other hand, the use of
redundant groups of current Ethernet hubs makes the loss of
communication between nodes extremely unlikely, but it is still possible.
In general, with mission-critical data, it is worth the cost to eliminate
even small risks associated with split-brain scenarios.
A split-brain situation is more likely to occur in a two-node cluster than
in a larger local cluster that splits into two even-sized sub-groups.
Split-brain is also more likely to occur in a disaster-tolerant cluster
where separate groups of nodes are located in different data centers.