Arbitration For Data Integrity in Serviceguard Clusters, July 2007

Arbitration for Data Integrity in Serviceguard Clusters

Arbitration in Disaster-Tolerant Clusters

Note that if the first lock disk is located in the first data center when

the heartbeat is lost, the first data center will normally obtain the

lock first because it is closest to the disk. Thus in this scenario, the

first data center will re-form the cluster.

3. If a node in one data center is successful at obtaining the first lock

disk but the disk link is not viable because the other data center is

down, then the first data center will not be able to obtain the second

lock disk, but because the lock was not refused, it will still be allowed

to re-form the cluster. This is the expected behavior when there is a

disaster.

4. If there is a loss of both heartbeat and disk link, there is a danger of

split brain because each sub-cluster, attempting to acquire both lock

disks, is able to obtain the lock in its own data center, and is not

refused the other lock. It is important to minimize or eliminate this

slight danger by ensuring that data and heartbeat links are

separately routed between data centers.

NOTE A dual lock disk configuration does not provide a redundant cluster lock.

In fact, the dual lock is a compound lock, and both disks have to

participate in the protocol of lock acquisition by the two equal-sized sets

of nodes. Even when mirrored LVM is used via MirrorDisk/UX, the lock

disk area is not mirrored.

At cluster formation time, a set of nodes must gain access to one disk,

and must either gain access to the other disk or not be denied access to it.

(“Not being denied” occurs when a disk is not accessible to a set of nodes.)

The group of nodes that gains access to at least one disk and is not

denied access by any disk is allowed to form the new cluster.

If one of the dual lock disks fails, Serviceguard will detect this when it

carries out periodic checking, and it will write a message to the syslog

file. After the loss of one of the lock disks, if the failure of a cluster node

results in the need for arbitration, the cluster will go down.