Optimizing Failover Time in a Serviceguard Environment, June 2007

Number of nodes and number of packages
Serviceguard’s Package Manager starts the package control scripts using the exec process. This time
is noticeable only when you have a large number of packages configured on a small number of
nodes. On most systems, for example, if 50 packages failed over to one node, it would take a few
seconds to start the scripts.
Slight effects are seen in two other places:
During re-formation, clusters of more than eight nodes might need extra time to synchronize cluster
information.
During resource recovery, the Package Manager has two tasks. First, it checks the packages to
determine which ones failed. The more packages there are, the more time this could take. Then it
needs to determine which nodes should adopt the packages and run them after re-formation. The
more packages each node has, the more time this could take.
EMS resources
There are two factors to consider about your EMS resources:
EMS resource monitor detection time—This depends entirely on the EMS resource monitor and how
it works. Look at the monitor’s documentation; usually you can set this time.
The time for the EMS message to get to Serviceguard—At most, this takes as long as the time set for
RESOURCE_POLLING_INTERVAL. In the package configuration file, you want to set the interval low
enough to discover failure quickly. However, if you set it too low, frequent polling just makes the
network and the system busier.
Package control scripts
When many storage units are involved, you might be able to reduce resource recovery time to help
optimize failover. Refer to the section “Optimizing for Large Numbers of Storage Units” in chapter 6
of the Managing Serviceguard manual for your version of Serviceguard. Manuals are available from
www.docs.hp.com/hpux/ha –> Serviceguard.
The type of file system can greatly reduce the time it takes for file consistency checks. For packages on
HP-UX, VxFS is faster than HFS. On Linux, a journaled file system (such as Reiser FS and ext3 FS) is
faster than a non-journaled file system (ext2).
Adding or removing IP addresses takes some time and affects failover time. On HP-UX, it takes at least
one second longer to add an IPv6 address if you enable duplicate address detection (DAD). For more
information, see “IPv6 Relocatable Address and Duplicate Address Detection” in the Managing
Serviceguard manual for your version for Serviceguard, available from
www.docs.hp.com/hpux/ha
> Serviceguard.
Make your control scripts as efficient as possible. The time needed to start and stop services adds to
the total failover time. Streamline any customer-defined functions to help save time.
System Restart Options
Generally shutdown(1M) is preferable option for restarting the system. shutdown halts all user
applications and invokes cmhaltnode to halt the Serviceguard cluster on the system. System buffer is
also flushed to the disk so almost all data is stored to the disk.
System restart with reboot(1M) command has a big impact on Serviceguard component of failover
time when VERITAS CVM 4.1 or VERITAS CFS is used. To optimize this failover time, make sure that
Serviceguard is cleanly halted on a node before rebooting it by using cmhaltcl (to halt the entire
12