Managing Serviceguard Sixteenth Edition, March 2009

Cluster Configuration Planning

A cluster should be designed to provide the quickest possible recovery from failures.

The actual time required to recover from a failure depends on several factors:

• The value of the cluster MEMBER_TIMEOUT.

See MEMBER_TIMEOUT under “Cluster Configuration Parameters ” (page 138)

for recommendations.

• The availability of raw disk access. Applications that use raw disk access should

be designed with crash recovery services.

• The application and database recovery time. They should be designed for the

shortest recovery time.

In addition, you must provide consistency across the cluster so that:

• User names are the same on all nodes.

• UIDs are the same on all nodes.

• GIDs are the same on all nodes.

• Applications in the system area are the same on all nodes.

• System time is consistent across the cluster.

• Files that could be used by more than one node, such as files in the /usr directory,

must be the same on all nodes.

Heartbeat Subnet and Cluster Re-formation Time

The speed of cluster re-formation depends on the number of heartbeat subnets.

If the cluster has only a single heartbeat network, and a network card on that network

fails, heartbeats will be lost while the failure is being detected and the IP address is

being switched to a standby interface. The cluster may treat these lost heartbeats as a

failure and re-form without one or more nodes. To prevent this, a minimum

MEMBER_TIMEOUT value of 14 seconds is required for clusters with a single heartbeat

network.

If there is more than one heartbeat subnet, and there is a failure on one of them,

heartbeats will go through another, so you can configure a smaller MEMBER_TIMEOUT

value.

Cluster Configuration Planning 137