Optimizing Failover Time in a Serviceguard Environment, June 2007

ManualsBrandsHP ManualsSoftwareHP SAP Linux Serviceguard Cluster Extension

• Quiescence—During this quiet waiting time, non-members of the newly formed cluster are rebooted.

• Cluster component recovery—Serviceguard does miscellaneous tasks, such as cluster information

synchronization and package determination, before the cluster resumes the work.

During the application-dependent phase of the failover time, Serviceguard starts the package control

scripts, which were written by the user. In standard Serviceguard implementations, there are two

steps, as shown in Figure 1.

• Resource recovery—The package’s resources are made available.

• Application recovery—If applications or processes were moved to a new node, they are restarted.

As shown in Figure 2, the application-dependent steps are a little different for an Oracle

Real

Application Cluster (RAC) package in a cluster with Serviceguard Extension for RAC.

Figure 2. Steps in a failover caused by a failed node—Serviceguard Extension for RAC implementation

Election Lock

acquisition

Quiescence Node

failure

detection

Cluster re-formation

Cluster component

recovery

Group

membership

reconfiguration

RAC

reconfiguration

and database

recovery

Serviceguard component of failover time Application-dependent

failover time

Note: Diagram is not to scale.

The two application-dependent steps for a RAC implementation are:

• Group membership reconfiguration—If there is a change in membership, RAC starts the

reconfiguration.

• RAC reconfiguration and database recovery—After a cluster membership change, RAC reassigns

the database locks that were on failed nodes and restarts the databases.

Node Failure Detection

If a node does not get a heartbeat message from another node, it will declare the other node

unreachable. A node may be unreachable for many reasons. There may be a transient interruption

that can recover automatically in a short time, such as a spike in network activity, I/O, or CPU, or a

temporary kernel hang. Or there may be a failure that will not recover automatically or quickly, such

as a hardware or power supply failure, or a crashed operating system.

A node is considered failed if there is no heartbeat during time specified as NODE_TIMEOUT

in the cluster configuration file. After the NODE_TIMEOUT value is reached, Serviceguard begins to

re-form the cluster without the failed node.

Cluster Reformation Time

Cluster reformation time includes three components:

• Election of Cluster Membership

• Lock Acquisition

• Quiescence