Optimizing Failover Time in a Serviceguard Environment, June 2007

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux RH AS Cluster

How you can optimize failover time

There are ways you can optimize the failover process for your environment to reduce the time a

package is unavailable. If the failover process takes longer than necessary, you are not getting

maximum availability. If the failover starts and completes too quickly, however, you may get

unnecessary failovers that reduce performance of a cluster—possibly reducing availability instead

of improving it. It is important to find a balance between the extremes.

The optimal failover time would be long enough to allow for recoverable interruptions, but no longer

than that.

The time required for the Serviceguard portion of failover depends largely on the node timeout value,

but it also depends on the heartbeat interval, the number of nodes in the cluster, and whether you are

using standby heartbeat interfaces. There are ways to fine-tune these factors to help optimize failover.

To set the Serviceguard parameters, you need to determine the likelihood of transient interruptions

and the amount of time it takes them to recover and continue. If your cluster is in a busy environment,

you need to tolerate interruptions or you will get unnecessary—and possibly repeated—failovers. If,

however, your networks and systems do not get overloaded, you can set your failover parameters

more aggressively.

Try to tune the environment so there are fewer interruptions and it takes less time to recover from them.

Consider ways to distribute the workload across the cluster. Consider adding a node to the cluster.

Because your timeout value has to allow for the longest recovery times, you can reduce the timeout

value if you can smooth out the peaks.

Consider the time it takes for the application-dependent component of failover. If your applications

need a short time for recovery and restart, you can afford to set your failover parameters more

aggressively. If your application recovery and restart is quick, it is an advantage to have a quick

reaction to failures. However, if it takes a long time for your applications to restart or for your

databases to recover, set the timeout value more conservatively. Before you start a lengthy failover,

it is an advantage to wait a bit for a transient problem to recover on its own.

Some help in estimating time for failover

The following table can help you estimate the total failover time for your Serviceguard cluster.

Failover component name Time estimate

Resource failure detection

Network failure detection

EMS resource failure detection

Service failure detection

6 x NETWORK_POLLING_INTERVAL

RESOURCE_POLLING_INTERVAL

Immediately

Node failure detection NODE_TIMEOUT

Cluster membership re-formation time (Serviceguard

component of failover time in case of node failure)

This depends mostly on NODE_TIMEOUT. It is also affected

by the use of standby heartbeat networks

HEARTBEAT_INTERVAL. (In releases before A.11.18, it is also

affected by the type of cluster lock.)

For example, the re-formation time is 28 seconds if

NODE_TIMEOUT is set to 2 seconds, there is more than one

heartbeat subnet, and a quorum server is configured with

QS_TIMEOUT_EXTENSION set to 0.

On HP-UX, check the time with the cmquerycl command: After

configuring a cluster, issue the command cmquerycl –c

<cluster_name> and observe the output. The value next to the

cluster lock is the cluster membership re-formation time.