Managing Serviceguard Sixteenth Edition, March 2009
Cluster Configuration Planning
A cluster should be designed to provide the quickest possible recovery from failures.
The actual time required to recover from a failure depends on several factors:
• The value of the cluster MEMBER_TIMEOUT.
See MEMBER_TIMEOUT under “Cluster Configuration Parameters ” (page 138)
for recommendations.
• The availability of raw disk access. Applications that use raw disk access should
be designed with crash recovery services.
• The application and database recovery time. They should be designed for the
shortest recovery time.
In addition, you must provide consistency across the cluster so that:
• User names are the same on all nodes.
• UIDs are the same on all nodes.
• GIDs are the same on all nodes.
• Applications in the system area are the same on all nodes.
• System time is consistent across the cluster.
• Files that could be used by more than one node, such as files in the /usr directory,
must be the same on all nodes.
Heartbeat Subnet and Cluster Re-formation Time
The speed of cluster re-formation depends on the number of heartbeat subnets.
If the cluster has only a single heartbeat network, and a network card on that network
fails, heartbeats will be lost while the failure is being detected and the IP address is
being switched to a standby interface. The cluster may treat these lost heartbeats as a
failure and re-form without one or more nodes. To prevent this, a minimum
MEMBER_TIMEOUT value of 14 seconds is required for clusters with a single heartbeat
network.
If there is more than one heartbeat subnet, and there is a failure on one of them,
heartbeats will go through another, so you can configure a smaller MEMBER_TIMEOUT
value.
Cluster Configuration Planning 137