Managing Serviceguard Sixteenth Edition, March 2009

Cluster Configuration Planning
A cluster should be designed to provide the quickest possible recovery from failures.
The actual time required to recover from a failure depends on several factors:
The value of the cluster MEMBER_TIMEOUT.
See MEMBER_TIMEOUT under “Cluster Configuration Parameters ” (page 138)
for recommendations.
The availability of raw disk access. Applications that use raw disk access should
be designed with crash recovery services.
The application and database recovery time. They should be designed for the
shortest recovery time.
In addition, you must provide consistency across the cluster so that:
User names are the same on all nodes.
UIDs are the same on all nodes.
GIDs are the same on all nodes.
Applications in the system area are the same on all nodes.
System time is consistent across the cluster.
Files that could be used by more than one node, such as files in the /usr directory,
must be the same on all nodes.
Heartbeat Subnet and Cluster Re-formation Time
The speed of cluster re-formation depends on the number of heartbeat subnets.
If the cluster has only a single heartbeat network, and a network card on that network
fails, heartbeats will be lost while the failure is being detected and the IP address is
being switched to a standby interface. The cluster may treat these lost heartbeats as a
failure and re-form without one or more nodes. To prevent this, a minimum
MEMBER_TIMEOUT value of 14 seconds is required for clusters with a single heartbeat
network.
If there is more than one heartbeat subnet, and there is a failure on one of them,
heartbeats will go through another, so you can configure a smaller MEMBER_TIMEOUT
value.
Cluster Configuration Planning 137