Optimizing Failover Time in a Serviceguard Environment, June 2007

• RAC reconfiguration and database recovery—After a cluster membership change, RAC reassigns

the database locks that were on failed nodes and restarts the databases.

Standard Serviceguard implementation: Resource Failure Detection

Serviceguard monitors the configured package services and networks.

Serviceguard package configuration can include Event Monitoring Service (EMS), which monitors

hardware such as storage. EMS polls an EMS resource monitor to get returns. The polling interval is

set in the package configuration file. When EMS finds a resource failure, it immediately notifies

Serviceguard.

Generally, when a resource fails, the package will fail over to another node. If the package is

configured with NODE_FAIL_FAST_ENABLED set to “yes”, Serviceguard will cause the node to fail. If

this happens, the process will start at the first step described in the previous section, “The process

when failover is caused by a node failure.”

Standard Serviceguard implementation: Package Determination

When a package fails, Serviceguard can automatically try to restart it if AUTO_RUN is set to YES

in the package configuration file. If AUTO_RUN is set to YES, Serviceguard next determines where to

start the package. It creates an ordered list of nodes, which is prioritized according to the node list

and the settings of the failover and failback policies in the package’s configuration file.

The user cannot directly change the time needed for package determination.

Standard Serviceguard implementation: Resource Recovery

Serviceguard starts each package’s control script to begin the application-dependent part of failover.

During failover, package resources need to be made available before applications can be started.

The package resources include IP addresses, file systems, volume groups, and disk groups. Some

resources may require recovery before they can be used. Instructions for this are in each package’s

control script.

The time required for resource recovery depends on the number of resources and the instructions in

the control script.

Standard Serviceguard implementation: Application Startup

The control scripts complete the application-dependent part of failover by following commands for the

recovery and restart of package applications. Some packages and some applications may also do

their own recovery before startup.

The time required for application startup depends on each application and how it is configured.

Serviceguard Extension for RAC: Group Membership Reconfiguration

RAC group membership reconfiguration is the same whether failover is caused by a node failure or

a package failure. To start group membership reconfiguration, Serviceguard Extension for RAC

communicates the group membership to Oracle RAC. If there is a change in membership, RAC will

do the reconfiguration.

The time required for group membership reconfiguration depends on the RAC configuration.

Serviceguard Extension for RAC: RAC Reconfiguration and Database Recovery

After Oracle RAC is notified of a cluster membership change, it starts its own reconfiguration and

recovery.

The time required for RAC reconfiguration and database recovery depends on the RAC configuration.