Optimizing Failover Time in a Serviceguard Environment, June 2007
• RAC reconfiguration and database recovery—After a cluster membership change, RAC reassigns
the database locks that were on failed nodes and restarts the databases.
Standard Serviceguard implementation: Resource Failure Detection
Serviceguard monitors the configured package services and networks.
Serviceguard package configuration can include Event Monitoring Service (EMS), which monitors
hardware such as storage. EMS polls an EMS resource monitor to get returns. The polling interval is
set in the package configuration file. When EMS finds a resource failure, it immediately notifies
Serviceguard.
Generally, when a resource fails, the package will fail over to another node. If the package is
configured with NODE_FAIL_FAST_ENABLED set to “yes”, Serviceguard will cause the node to fail. If
this happens, the process will start at the first step described in the previous section, “The process
when failover is caused by a node failure.”
Standard Serviceguard implementation: Package Determination
When a package fails, Serviceguard can automatically try to restart it if AUTO_RUN is set to YES
in the package configuration file. If AUTO_RUN is set to YES, Serviceguard next determines where to
start the package. It creates an ordered list of nodes, which is prioritized according to the node list
and the settings of the failover and failback policies in the package’s configuration file.
The user cannot directly change the time needed for package determination.
Standard Serviceguard implementation: Resource Recovery
Serviceguard starts each package’s control script to begin the application-dependent part of failover.
During failover, package resources need to be made available before applications can be started.
The package resources include IP addresses, file systems, volume groups, and disk groups. Some
resources may require recovery before they can be used. Instructions for this are in each package’s
control script.
The time required for resource recovery depends on the number of resources and the instructions in
the control script.
Standard Serviceguard implementation: Application Startup
The control scripts complete the application-dependent part of failover by following commands for the
recovery and restart of package applications. Some packages and some applications may also do
their own recovery before startup.
The time required for application startup depends on each application and how it is configured.
Serviceguard Extension for RAC: Group Membership Reconfiguration
RAC group membership reconfiguration is the same whether failover is caused by a node failure or
a package failure. To start group membership reconfiguration, Serviceguard Extension for RAC
communicates the group membership to Oracle RAC. If there is a change in membership, RAC will
do the reconfiguration.
The time required for group membership reconfiguration depends on the RAC configuration.
Serviceguard Extension for RAC: RAC Reconfiguration and Database Recovery
After Oracle RAC is notified of a cluster membership change, it starts its own reconfiguration and
recovery.
The time required for RAC reconfiguration and database recovery depends on the RAC configuration.
7