Optimizing Failover Time in a Serviceguard Environment, June 2007
15
Cluster parameter considerations
There are risks associated with Serviceguard Extension for Faster Failover’s shorter failover time. If
there is a temporary interruption, the timeout should be long enough for the interruption to recover.
Each administrator must decide how much time the cluster needs to confirm the likelihood (if not
certainty) of a failure. On a busy system where the networks, I/O, or CPU have frequent or large
spikes in activity, transient problems are likely to cause delayed heartbeats.
With SGeFF, a minimum value of 1.6 seconds for NODE_TIMEOUT, with 0.8 seconds for
HEARTBEAT_INTERVAL, is supported. With these values, the Serviceguard component of failover time
can be reduced to 5 seconds. Such low values are not suitable for all SGeFF environments.
NODE_TIMEOUT must be carefully set and carefully tested in a SGeFF cluster. The Faster Failover
cluster re-formation process finishes immediately, so it becomes more likely that a healthy node with
transient problems will be timed out of the cluster.
Node timeout should not be so short that it cannot tolerate transient problems or temporary
interruptions. Each administrator must determine the optimal time that Serviceguard should wait for
interruptions to recover before it times out and acts.
Node timeout should not be so short that a delayed heartbeat from a healthy node causes the cluster
to begin re-forming. If it is, a node may be taken out of the cluster unnecessarily, or the node may
recover in time to rejoin the cluster. In the second case, the cluster will re-form, but the membership
will be just the same as before the re-formation.
The value of NODE_TIMEOUT will have a large effect in a Faster Failover cluster. For example,
two identical two-node clusters have a configuration valid for Faster Failover. One has a
standard Serviceguard implementation, the other has SGeFF enabled. They both have
QS_TIMEOUT_EXTENSION set to zero and NODE_TIMEOUT set to 2 seconds. One has
a standard Serviceguard implementation; one has SGeFF enabled. If there is a transient problem
that lasts for 8 seconds, the two clusters will have different results:
• With a standard Serviceguard implementation, re-formation will take about 28 seconds. The
transient problem will recover before re-formation is done, so the node will stay up and will be able
to rejoin the cluster.
• With SGeFF, the re-formation will complete in 6 seconds, far less than the time needed for the
problem to recover. The cluster will re-form without it. The node will be rebooted.
Failover in a SGeFF cluster is much faster than failover in a standard Serviceguard cluster. Using the
example SGeFF cluster discussed above, a SGeFF cluster can be set to twice the node timeout and still
require less than half the time for the Serviceguard component of failover:
• With a standard Serviceguard cluster installed as discussed above, the timeout value of 2 seconds
completed the Serviceguard component of failover in 28 seconds.
• With SGeFF installed and the timeout value changed to 4 seconds, the Serviceguard component of
failover will complete in 12 seconds.
Virtual partitions may have different latency characteristics than independent nodes due to hardware
and firmware sharing. If you will be using virtual partitions in your cluster, there may be additional
considerations when testing the node timeout value for your configuration. For more information, see
the white paper “Serviceguard Cluster Configuration for Partitioned Systems”, available from
www.docs.hp.com/hpux/ha.