Managing Serviceguard 11th Edition, Version A.11.16, Second Printing June 2004

Understanding Serviceguard Software Components
Responses to Failures
Chapter 3120
Responses to Hardware Failures
If a serious system problem occurs, such as a system panic or physical
disruption of the SPU's circuits, Serviceguard recognizes a node failure
and transfers the packages currently running on that node to an
adoptive node elsewhere in the cluster. The new location for each
package is determined by that package's configuration file, which lists
primary and alternate nodes for the package. Transfer of a package to
another node does not transfer the program counter. Processes in a
transferred package will restart from the beginning. In order for an
application to be expeditiously restarted after a failure, it must be
“crash-tolerant”; that is, all processes in the package must be written so
that they can detect such a restart. This is the same application design
required for restart after a normal system crash.
In the event of a LAN interface failure, a local switch is done to a standby
LAN interface if one exists. If a heartbeat LAN interface fails and no
standby or redundant heartbeat is configured, the node fails with a TOC.
If a monitored data LAN interface fails without a standby, the node fails
with a TOC only if NODE_FAILFAST_ENABLED (described further in the
“Planning” chapter under “Package Configuration Planning”) is set to
YES for the package.
Disk protection is provided by separate products, such as MirrorDisk/UX
in LVM or VERITAS mirroring in VxVM and CVM. In addition,
separately available EMS disk monitors allow you to notify operations
personnel when a specific failure, such as a lock disk failure, takes place.
Refer to the manual Using High Availablity Monitors (HP part number
B5736-90042) for additional information.
Serviceguard does not respond directly to power failures, although a loss
of power to an individual cluster component may appear to Serviceguard
like the failure of that component, and will result in the appropriate
switching behavior. Power protection is provided by HP-supported
uninterruptible power supplies (UPS), such as HP PowerTrust.
Responses to Package and Service Failures
In the default case, the failure of the package or of a service within a
package causes the package to shut down by running the control script
with the 'stop' parameter, and then restarting the package on an
alternate node. If the package manager receives a report of an EMS
monitor event showing that a configured resource dependency is not met,
the package fails and tries to restart on the alternate node.