Optimizing Failover Time in a Serviceguard Environment, June 2007

Cluster component recovery
This depends on the number of EMS resources, packages,
nodes, etc. For environments using LVM or CVM 3.5 this time
is usually less than a second. For environments using VERITAS
CVM 4.1 or VERITAS CFS, VERITAS components, depending
on the failure type, this time can range from 5 seconds to an
additional cluster reformation time.
Resource recovery
This depends on the number of volume groups, IP addresses,
services, etc. It usually ranges from a low of less than
1 second to a high of several minutes.
Application startup and recovery time This is totally dependent on the application.
Node timeout value
To help optimize failover time, first consider fine-tuning the setting for NODE_TIMEOUT in your cluster
configuration file. Changing this probably will make the greatest difference in your cluster failover
time. When a node times out, Serviceguard declares the node failed and begins cluster re-formation.
For Serviceguard, the range of supported values of NODE_TIMEOUT is 2 to 30 seconds. The
recommended value is from 5 to 8 seconds for most clusters.
Reducing the node timeout decreases the time to detect node failures, which can decrease the total
failover time. However, a small node timeout value also introduces a risk. If there are temporary
interruptions and you set the timeout value so low that the node cannot recover communication, the
node might fail unnecessarily or the cluster might re-form unnecessarily.
Setting the parameters too low may cause failovers that you could avoid. If you set your parameters
so low that re-formation can complete before an unreachable node can recover from a temporary
interruption, the node will be forcibly rebooted. Any packages running on it will be failed-over to
another node.
If the unreachable node can recover communication before the re-formation completes, the node will
rejoin the cluster. If you see that re-formation has taken place but nothing has changed, your timeout
has already approached its limit; do not try to shorten the overall failover time. Another indication that
you have approached the limit is a message in syslog that says, “Warning: cmcld process was
unable to run for the last <xxx> seconds.
If your application takes a long time to recover and restart, set the node timeout conservatively. If your
database takes several minutes to recover, it isn’t worth risking an unnecessary failover to shave a
few seconds off of the Serviceguard failover time.
If your application restarts quickly, you can afford to set the node timeout more aggressively. When
your application only takes a few seconds to restart, there is little benefit in waiting a few seconds for
an interruption to recover. You can afford to try a short timeout when you have short recovery and
restart times.
Small, lightly loaded systems are likely to have fewer interruptions, and they are likely to recover more
quickly. Highly loaded systems with a large number of disks are likely to have more frequent
interruptions, and they are likely to take longer to recover. Try to spread the load and avoid spikes
in activity. Set the node timeout to allow recovery time for interruptions when the load is heaviest.
Virtual partitions may have different latency characteristics than independent nodes due to hardware
and firmware sharing. If you will be using virtual partitions in your cluster, there may be additional
considerations when testing the node timeout value for your configuration. For more information, see
the white paper “Serviceguard Cluster Configuration for Partitioned Systems”, available from
www.docs.hp.com/hpux/ha.
9