Optimizing Failover Time in a Serviceguard Environment, June 2007
Cluster Component Recovery
In this step, Serviceguard does miscellaneous tasks, such as cluster information synchronization and
package determination. If packages are down due to a failure, Serviceguard determines on which
node(s), if any, they should be restarted. (See “Standard Serviceguard implementation: Package
Determination” on page 6.)
The time needed for cluster component recovery depends mainly on how many packages need to be
restarted. At the end of this recovery phase, Serviceguard starts the packages by starting the
packages’ control scripts one after the other, using an “exec” process.
The user cannot directly change the time needed for cluster component recovery and in general, it is a
short step (typically less than one second).
Environments using Serviceguard with VERITAS CVM 4.1 or Serviceguard Storage Management Suite
with VERITAS CFS require additional time during cluster component recovery to synchronize cluster
memberships between Serviceguard and VERITAS cluster components prior to package determination.
The time required to synchronize memberships largely depends on the type of failure. There are three
types of failures to consider. In the case of system panic, machine check, or power failure, cluster
component recovery requires an additional 4 seconds. Alternatively, in the case of a node or service
failfast type failure, an additional 8 seconds is required. Finally, in failures where the cluster monitor
is unable to run or is killed, such as kernel hangs or reboot(1M), it can take up to an additional cluster
reformation time to synchronize the memberships.
Users with Serviceguard and VERITAS 4.1 or VERITAS CFS configurations can minimize cluster
component recovery time by always using cmhaltnode(1M) prior to issuing shutdown(1M) or
reboot(1M) when restarting a node in the cluster.
Standard Serviceguard implementation: Resource Recovery
When Serviceguard starts a package’s control script, the application-dependent part of failover
begins. Package resources are made available, ready for the package’s applications to start.
Package resources include IP addresses, file systems, volume groups, and disk groups needed by the
package. Some resources may require other recovery steps before they can be used.
The time to complete resource recovery is determined largely by the package control script and
depends on the package’s applications, services, and resources.
Standard Serviceguard implementation: Applications Recovery
The control scripts’ commands complete the application-dependent part of failover, recovering and
restarting package applications. The amount of time it takes depends on the applications and how
they are configured.
Serviceguard Extension for RAC: Group Membership Reconfiguration
When Serviceguard Extension for RAC communicates the group membership to Oracle RAC, the
application-dependent part of a RAC failover starts. If there is a change in membership, RAC will start
reconfiguration. RAC needs to know which nodes are in the re-formed cluster; if the node holding the
database lock leaves the cluster, another node needs to claim the lock.
The time needed for group membership reconfiguration is determined by RAC, and the user cannot
directly change it.
Serviceguard Extension for RAC: RAC Reconfiguration
After Oracle RAC is notified of a cluster membership change, it starts its own reconfiguration to claim
the database locks that were on failed nodes. RAC reconfiguration and recovery occurs in the RAC
instances running on the other nodes in the cluster.
5