Optimizing Failover Time in a Serviceguard Environment, June 2007

Cluster parameter considerations

There are risks associated with Serviceguard Extension for Faster Failover’s shorter failover time. If

there is a temporary interruption, the timeout should be long enough for the interruption to recover.

Each administrator must decide how much time the cluster needs to confirm the likelihood (if not

certainty) of a failure. On a busy system where the networks, I/O, or CPU have frequent or large

spikes in activity, transient problems are likely to cause delayed heartbeats.

With SGeFF, a minimum value of 1.6 seconds for NODE_TIMEOUT, with 0.8 seconds for

HEARTBEAT_INTERVAL, is supported. With these values, the Serviceguard component of failover time

can be reduced to 5 seconds. Such low values are not suitable for all SGeFF environments.

NODE_TIMEOUT must be carefully set and carefully tested in a SGeFF cluster. The Faster Failover

cluster re-formation process finishes immediately, so it becomes more likely that a healthy node with

transient problems will be timed out of the cluster.

Node timeout should not be so short that it cannot tolerate transient problems or temporary

interruptions. Each administrator must determine the optimal time that Serviceguard should wait for

interruptions to recover before it times out and acts.

Node timeout should not be so short that a delayed heartbeat from a healthy node causes the cluster

to begin re-forming. If it is, a node may be taken out of the cluster unnecessarily, or the node may

recover in time to rejoin the cluster. In the second case, the cluster will re-form, but the membership

will be just the same as before the re-formation.

The value of NODE_TIMEOUT will have a large effect in a Faster Failover cluster. For example,

two identical two-node clusters have a configuration valid for Faster Failover. One has a

standard Serviceguard implementation, the other has SGeFF enabled. They both have

QS_TIMEOUT_EXTENSION set to zero and NODE_TIMEOUT set to 2 seconds. One has

a standard Serviceguard implementation; one has SGeFF enabled. If there is a transient problem

that lasts for 8 seconds, the two clusters will have different results:

• With a standard Serviceguard implementation, re-formation will take about 28 seconds. The

transient problem will recover before re-formation is done, so the node will stay up and will be able

to rejoin the cluster.

• With SGeFF, the re-formation will complete in 6 seconds, far less than the time needed for the

problem to recover. The cluster will re-form without it. The node will be rebooted.

Failover in a SGeFF cluster is much faster than failover in a standard Serviceguard cluster. Using the

example SGeFF cluster discussed above, a SGeFF cluster can be set to twice the node timeout and still

require less than half the time for the Serviceguard component of failover:

• With a standard Serviceguard cluster installed as discussed above, the timeout value of 2 seconds

completed the Serviceguard component of failover in 28 seconds.

• With SGeFF installed and the timeout value changed to 4 seconds, the Serviceguard component of

failover will complete in 12 seconds.

Virtual partitions may have different latency characteristics than independent nodes due to hardware

and firmware sharing. If you will be using virtual partitions in your cluster, there may be additional

considerations when testing the node timeout value for your configuration. For more information, see

the white paper “Serviceguard Cluster Configuration for Partitioned Systems”, available from