Optimizing Failover Time in a Serviceguard Environment, June 2007

Cluster Component Recovery

In this step, Serviceguard does miscellaneous tasks, such as cluster information synchronization and

package determination. If packages are down due to a failure, Serviceguard determines on which

node(s), if any, they should be restarted. (See “Standard Serviceguard implementation: Package

Determination” on page 6.)

The time needed for cluster component recovery depends mainly on how many packages need to be

restarted. At the end of this recovery phase, Serviceguard starts the packages by starting the

packages’ control scripts one after the other, using an “exec” process.

The user cannot directly change the time needed for cluster component recovery and in general, it is a

short step (typically less than one second).

Environments using Serviceguard with VERITAS CVM 4.1 or Serviceguard Storage Management Suite

with VERITAS CFS require additional time during cluster component recovery to synchronize cluster

memberships between Serviceguard and VERITAS cluster components prior to package determination.

The time required to synchronize memberships largely depends on the type of failure. There are three

types of failures to consider. In the case of system panic, machine check, or power failure, cluster

component recovery requires an additional 4 seconds. Alternatively, in the case of a node or service

failfast type failure, an additional 8 seconds is required. Finally, in failures where the cluster monitor

is unable to run or is killed, such as kernel hangs or reboot(1M), it can take up to an additional cluster

reformation time to synchronize the memberships.

Users with Serviceguard and VERITAS 4.1 or VERITAS CFS configurations can minimize cluster

component recovery time by always using cmhaltnode(1M) prior to issuing shutdown(1M) or

reboot(1M) when restarting a node in the cluster.

Standard Serviceguard implementation: Resource Recovery

When Serviceguard starts a package’s control script, the application-dependent part of failover

begins. Package resources are made available, ready for the package’s applications to start.

Package resources include IP addresses, file systems, volume groups, and disk groups needed by the

package. Some resources may require other recovery steps before they can be used.

The time to complete resource recovery is determined largely by the package control script and

depends on the package’s applications, services, and resources.

Standard Serviceguard implementation: Applications Recovery

The control scripts’ commands complete the application-dependent part of failover, recovering and

restarting package applications. The amount of time it takes depends on the applications and how

they are configured.

Serviceguard Extension for RAC: Group Membership Reconfiguration

When Serviceguard Extension for RAC communicates the group membership to Oracle RAC, the

application-dependent part of a RAC failover starts. If there is a change in membership, RAC will start

reconfiguration. RAC needs to know which nodes are in the re-formed cluster; if the node holding the

database lock leaves the cluster, another node needs to claim the lock.

The time needed for group membership reconfiguration is determined by RAC, and the user cannot

directly change it.

Serviceguard Extension for RAC: RAC Reconfiguration

After Oracle RAC is notified of a cluster membership change, it starts its own reconfiguration to claim

the database locks that were on failed nodes. RAC reconfiguration and recovery occurs in the RAC

instances running on the other nodes in the cluster.