Managing Serviceguard 14th Edition, June 2007

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

121

122

123

124

125

126

127

128

129

130

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3 127

For more information on cluster failover, see the white paper Optimizing

Failover Time in a Serviceguard Environment at

http://www.docs.hp.com->High

Availability->Serviceguard->White Papers.

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical

disruption of the SPU's circuits, Serviceguard recognizes a node failure

and transfers the failover packages currently running on that node to an

adoptive node elsewhere in the cluster. (System multi-node and

multi-node packages do not failover.)

The new location for each failover package is determined by that

package's configuration file, which lists primary and alternate nodes for

the package. Transfer of a package to another node does not transfer the

program counter. Processes in a transferred package will restart from

the beginning. In order for an application to be swiftly restarted after a

failure, it must be “crash-tolerant”; that is, all processes in the package

must be written so that they can detect such a restart. This is the same

application design required for restart after a normal system crash.

In the event of a LAN interface failure, a local switch is done to a standby

LAN interface if one exists. If a heartbeat LAN interface fails and no

standby or redundant heartbeat is configured, the node fails with a

system reset. If a monitored data LAN interface fails without a standby,

the node fails with a system reset only if NODE_FAILFAST_ENABLED

(described further in “Package Configuration Planning” on page 165) is

set to YES for the package. Otherwise any packages using that LAN

interface will be halted and moved to another node if possible.

Disk protection is provided by separate products, such as Mirrordisk/UX

in LVM or Veritas mirroring in VxVM and related products. In addition,

separately available EMS disk monitors allow you to notify operations

personnel when a specific failure, such as a lock disk failure, takes place.

Refer to the manual Using High Availability Monitors (HP part number

B5736-90046) for additional information; you can find it at

http://www.docs.hp.com->11i v2->HP-UX 11i v2 Enterprise

Operating Environment->Event Monitoring Service and HA

Monitors.