Managing Serviceguard Nineteenth Edition, Reprinted June 2011

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

Failure. Only one LAN has been configured for both heartbeat and data traffic. During the course

of operations, heavy application traffic monopolizes the bandwidth of the network, preventing

heartbeat packets from getting through.

Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts to reform

as a one-node cluster. Likewise, since SystemB does not receive heartbeat messages from

SystemA, SystemB also attempts to reform as a one-node cluster. During the election protocol,

each node votes for itself, giving both nodes 50 percent of the vote. Because both nodes have 50

percent of the vote, both nodes now vie for the cluster lock. Only one node will get the lock.

Outcome. Assume SystemA gets the cluster lock. SystemA reforms as a one-node cluster. After

re-formation, SystemA will make sure all applications configured to run on an existing cluster

node are running. When SystemA discovers Package2 is not running in the cluster it will try to

start Package2 if Package2 is configured to run on SystemA.

SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To

release all resources related to Package2 (such as exclusive access to volume group vg02 and

the Package2 IP address) as quickly as possible, SystemB halts (system reset).

NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster ($SGAUTOSTART) is set

to zero, the node will not attempt to join the cluster when it comes back up.

For more information on cluster failover, see the white paper Optimizing Failover Time in a

Serviceguard Environment (version A.11.19 and later) at http://www.hp.com/go/

hpux-serviceguard-docs. For troubleshooting information, see “Cluster Re-formations Caused by

MEMBER_TIMEOUT Being Set too Low” (page 320).

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical disruption of the SPU's

circuits, Serviceguard recognizes a node failure and transfers the failover packages currently

running on that node to an adoptive node elsewhere in the cluster. (System multi-node and multi-node

packages do not fail over.)

The new location for each failover package is determined by that package's configuration file,

which lists primary and alternate nodes for the package. Transfer of a package to another node

does not transfer the program counter. Processes in a transferred package will restart from the

beginning. In order for an application to be swiftly restarted after a failure, it must be

“crash-tolerant”; that is, all processes in the package must be written so that they can detect such

a restart. This is the same application design required for restart after a normal system crash.

In the event of a LAN interface failure, a local switch is done to a standby LAN interface if one

exists. If a heartbeat LAN interface fails and no standby or redundant heartbeat is configured, the

node fails with a system reset. If a monitored data LAN interface fails without a standby, the node

fails with a system reset only if node_fail_fast_enabled (page 224) is set to YES for the

package. Otherwise any packages using that LAN interface will be halted and moved to another

node if possible (unless the LAN recovers immediately; see “When a Service, Subnet, or Monitored

Resource Fails, or a Dependency is Not Met” (page 61)).

Disk protection is provided by separate products, such as Mirrordisk/UX in LVM or Veritas mirroring

in VxVM and related products. In addition, separately available EMS disk monitors allow you to

notify operations personnel when a specific failure, such as a lock disk failure, takes place. Refer

to the manual Using High Availability Monitors, which you can find at the address given in the

preface to this manual.

Serviceguard does not respond directly to power failures, although a loss of power to an individual

cluster component may appear to Serviceguard like the failure of that component, and will result

in the appropriate switching behavior. Power protection is provided by HP-supported uninterruptible

power supplies (UPS).

86 Understanding Serviceguard Software Components