Managing Serviceguard 11th Edition, Version A.11.16, Second Printing June 2004

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

111

112

113

114

115

116

117

118

119

120

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3120

Responses to Hardware Failures

If a serious system problem occurs, such as a system panic or physical

disruption of the SPU's circuits, Serviceguard recognizes a node failure

and transfers the packages currently running on that node to an

adoptive node elsewhere in the cluster. The new location for each

package is determined by that package's configuration file, which lists

primary and alternate nodes for the package. Transfer of a package to

another node does not transfer the program counter. Processes in a

transferred package will restart from the beginning. In order for an

application to be expeditiously restarted after a failure, it must be

“crash-tolerant”; that is, all processes in the package must be written so

that they can detect such a restart. This is the same application design

required for restart after a normal system crash.

In the event of a LAN interface failure, a local switch is done to a standby

LAN interface if one exists. If a heartbeat LAN interface fails and no

standby or redundant heartbeat is configured, the node fails with a TOC.

If a monitored data LAN interface fails without a standby, the node fails

with a TOC only if NODE_FAILFAST_ENABLED (described further in the

“Planning” chapter under “Package Configuration Planning”) is set to

YES for the package.

Disk protection is provided by separate products, such as MirrorDisk/UX

in LVM or VERITAS mirroring in VxVM and CVM. In addition,

separately available EMS disk monitors allow you to notify operations

personnel when a specific failure, such as a lock disk failure, takes place.

Refer to the manual Using High Availablity Monitors (HP part number

B5736-90042) for additional information.

Serviceguard does not respond directly to power failures, although a loss

of power to an individual cluster component may appear to Serviceguard

like the failure of that component, and will result in the appropriate

switching behavior. Power protection is provided by HP-supported

uninterruptible power supplies (UPS), such as HP PowerTrust.

Responses to Package and Service Failures

In the default case, the failure of the package or of a service within a

package causes the package to shut down by running the control script

with the 'stop' parameter, and then restarting the package on an

alternate node. If the package manager receives a report of an EMS

monitor event showing that a configured resource dependency is not met,

the package fails and tries to restart on the alternate node.