Optimizing Failover Time in a Serviceguard Environment, June 2007

Testing
To fine-tune the parameters, it is important to test the cluster in an environment that imitates the actual
production environment. Test the cluster, running all of its packages, with the heaviest expected loads
on networks, CPU, and I/O.
To time failover, force each package to fail over to another node. One way to force a failover is to
power off the node where the package is running. Read the logs of the failover, noting the time
stamps.
Change parameters in small increments, then re-evaluate and re-test. Check the system log. If there
are indications of interruptions or transient problems, try to determine the recovery time. Look for re-
formations that occurred without a change in cluster membership, which indicate you have reached
the lower time limit.
Try different settings for the node timeout until you find the optimal one. You want a value that results
in the shortest failover time without any unnecessary re-formations or failovers from recoverable
temporary problems.
Allow a margin of safety for the tested node timeout value. How much time to allow depends on how
closely your test environment reflects your actual environment at its busiest.
Re-test and re-evaluate your settings periodically, especially when new disks, new networks, or new
applications are added to the cluster. Monitor traffic on heartbeat networks; watch for increases in
traffic, especially ones that could cause temporary spikes.
Lock acquisition (cluster lock, also called tie-breaker or arbitrator)
Serviceguard A.11.17 and earlier
Your choice of lock disk may help you optimize failover time. If you set the node timeout to less than 6
seconds, a quorum server, lock LUN, or Logical Volume Manager (LVM) SCSI lock disk will shorten
the time for lock acquisition. If the node timeout is more than 6 seconds, however, the type of lock will
probably not affect the total failover time.
Different lock disks have different lock acquisition times. If you use a lock disk or plan to use one,
consult its documentation to see how long acquisition takes.
It takes about 10 seconds to acquire an LVM SCSI lock disk.
It takes about 32 seconds to acquire a Fibre Channel LVM lock disk.
The time to acquire a lock LUN SCSI or a quorum server increases with the value of
NODE_TIMEOUT, but it is not directly proportional. For example, if the node timeout is
2 seconds, acquisition takes about 8 seconds; when the node timeout is set to 4 seconds,
it takes about 14 seconds.
The failover time with a quorum server is identical to the failover time with no cluster lock, unless
you specify QS_TIMEOUT_EXTENSION; see below.
Serviceguard A.11.18 and later
Your choice of lock disk may help you optimize failover time, but only if you set the node timeout to
less than 2.5 seconds. If the node timeout is more than 2.5 seconds, however, the type of lock will
probably not affect the total failover time.
It takes about 10 seconds to acquire an LVM lock disk or lock disk LUN.
The failover time with a quorum server is identical to the failover time with no cluster lock, unless
you specify QS_TIMEOUT_EXTENSION; see below.
10