Optimizing Failover Time in a Serviceguard Environment, June 2007

How you can optimize failover time
There are ways you can optimize the failover process for your environment to reduce the time a
package is unavailable. If the failover process takes longer than necessary, you are not getting
maximum availability. If the failover starts and completes too quickly, however, you may get
unnecessary failovers that reduce performance of a cluster—possibly reducing availability instead
of improving it. It is important to find a balance between the extremes.
The optimal failover time would be long enough to allow for recoverable interruptions, but no longer
than that.
The time required for the Serviceguard portion of failover depends largely on the node timeout value,
but it also depends on the heartbeat interval, the number of nodes in the cluster, and whether you are
using standby heartbeat interfaces. There are ways to fine-tune these factors to help optimize failover.
To set the Serviceguard parameters, you need to determine the likelihood of transient interruptions
and the amount of time it takes them to recover and continue. If your cluster is in a busy environment,
you need to tolerate interruptions or you will get unnecessary—and possibly repeated—failovers. If,
however, your networks and systems do not get overloaded, you can set your failover parameters
more aggressively.
Try to tune the environment so there are fewer interruptions and it takes less time to recover from them.
Consider ways to distribute the workload across the cluster. Consider adding a node to the cluster.
Because your timeout value has to allow for the longest recovery times, you can reduce the timeout
value if you can smooth out the peaks.
Consider the time it takes for the application-dependent component of failover. If your applications
need a short time for recovery and restart, you can afford to set your failover parameters more
aggressively. If your application recovery and restart is quick, it is an advantage to have a quick
reaction to failures. However, if it takes a long time for your applications to restart or for your
databases to recover, set the timeout value more conservatively. Before you start a lengthy failover,
it is an advantage to wait a bit for a transient problem to recover on its own.
Some help in estimating time for failover
The following table can help you estimate the total failover time for your Serviceguard cluster.
Failover component name Time estimate
Resource failure detection
Network failure detection
EMS resource failure detection
Service failure detection
6 x NETWORK_POLLING_INTERVAL
RESOURCE_POLLING_INTERVAL
Immediately
Node failure detection NODE_TIMEOUT
Cluster membership re-formation time (Serviceguard
component of failover time in case of node failure)
This depends mostly on NODE_TIMEOUT. It is also affected
by the use of standby heartbeat networks
HEARTBEAT_INTERVAL. (In releases before A.11.18, it is also
affected by the type of cluster lock.)
For example, the re-formation time is 28 seconds if
NODE_TIMEOUT is set to 2 seconds, there is more than one
heartbeat subnet, and a quorum server is configured with
QS_TIMEOUT_EXTENSION set to 0.
On HP-UX, check the time with the cmquerycl command: After
configuring a cluster, issue the command cmquerycl –c
<cluster_name> and observe the output. The value next to the
cluster lock is the cluster membership re-formation time.
8