Veritas Storage Foundation 5.1 SP1 Cluster File System Administrator"s Guide (5900-1738, April 2011)
Example of a pre-existing network partition (split-brain)
Figure 7-1 shows a two-node cluster in which the severed cluster interconnect
poses a potential split-brain condition.
Figure 7-1
Pre-existing network partition (split-brain)
Second-Node 0
ejects key B for disk
1 and succeeds.
Node 0
Node 1
Coordinator Disks
Finally- Node 1
panics and reboots.
First - Interconnect failure causes both
nodes to race.
Third-Node 0
ejects key B for disk
2 and succeeds.
Fourth-Node 0
ejects key B for disk
3 and succeeds.
Fifth-Node 0
continues and
performs recovery.
Second (part B) Node
1 fails to eject key A
for disk 1. Rereads
Third (part B) - Node 1
fails to eject key A for
disk 2. Rereads keys.
Fourth (part B) - Node
1 fails to eject keys
disk 3.
Because the fencing module operates identically on each system, both nodes
assume the other is failed, and carry out fencing operations to insure the other
node is ejected. The VCS GAB module on each node determines the peer has failed
due to loss of heartbeats and passes the membership change to the fencing module.
Each side “races” to gain control of the coordinator disks. Only a registered node
can eject the registration of another node, so only one side successfully completes
the command on each disk.
The side that successfully ejects the peer from a majority of the coordinator disks
wins. The fencing module on the winning side then passes the membership change
up to VCS and other higher-level packages registered with the fencing module,
allowing VCS to invoke recovery actions. The losing side forces a kernel panic and
reboots.
Recovering from a pre-existing network partition (split-brain)
The fencing module vxfen prevents a node from starting up after a network
partition and subsequent panic and reboot of a node.
Troubleshooting SFCFS
Troubleshooting fenced configurations
176