Optimizing Failover Time in a Serviceguard Environment, June 2007

Number of nodes and number of packages

Serviceguard’s Package Manager starts the package control scripts using the exec process. This time

is noticeable only when you have a large number of packages configured on a small number of

nodes. On most systems, for example, if 50 packages failed over to one node, it would take a few

seconds to start the scripts.

Slight effects are seen in two other places:

• During re-formation, clusters of more than eight nodes might need extra time to synchronize cluster

information.

• During resource recovery, the Package Manager has two tasks. First, it checks the packages to

determine which ones failed. The more packages there are, the more time this could take. Then it

needs to determine which nodes should adopt the packages and run them after re-formation. The

more packages each node has, the more time this could take.

EMS resources

There are two factors to consider about your EMS resources:

• EMS resource monitor detection time—This depends entirely on the EMS resource monitor and how

it works. Look at the monitor’s documentation; usually you can set this time.

• The time for the EMS message to get to Serviceguard—At most, this takes as long as the time set for

RESOURCE_POLLING_INTERVAL. In the package configuration file, you want to set the interval low

enough to discover failure quickly. However, if you set it too low, frequent polling just makes the

network and the system busier.

Package control scripts

When many storage units are involved, you might be able to reduce resource recovery time to help

optimize failover. Refer to the section “Optimizing for Large Numbers of Storage Units” in chapter 6

of the Managing Serviceguard manual for your version of Serviceguard. Manuals are available from

www.docs.hp.com/hpux/ha –> Serviceguard.

The type of file system can greatly reduce the time it takes for file consistency checks. For packages on

HP-UX, VxFS is faster than HFS. On Linux, a journaled file system (such as Reiser FS and ext3 FS) is

faster than a non-journaled file system (ext2).

Adding or removing IP addresses takes some time and affects failover time. On HP-UX, it takes at least

one second longer to add an IPv6 address if you enable duplicate address detection (DAD). For more

information, see “IPv6 Relocatable Address and Duplicate Address Detection” in the Managing

Serviceguard manual for your version for Serviceguard, available from

www.docs.hp.com/hpux/ha –

> Serviceguard.

Make your control scripts as efficient as possible. The time needed to start and stop services adds to

the total failover time. Streamline any customer-defined functions to help save time.

System Restart Options

Generally shutdown(1M) is preferable option for restarting the system. shutdown halts all user

applications and invokes cmhaltnode to halt the Serviceguard cluster on the system. System buffer is

also flushed to the disk so almost all data is stored to the disk.

System restart with reboot(1M) command has a big impact on Serviceguard component of failover

time when VERITAS CVM 4.1 or VERITAS CFS is used. To optimize this failover time, make sure that

Serviceguard is cleanly halted on a node before rebooting it by using cmhaltcl (to halt the entire