Managing Serviceguard Fifteenth Edition, reprinted May 2008

Understanding Serviceguard Software Components
Responses to Failures
Chapter 3126
Responses to Failures
Serviceguard responds to different kinds of failures in specific ways. For
most hardware failures, the response is not user-configurable, but for
package and service failures, you can choose the system’s response,
within limits.
System Reset When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is an
HP-UX TOC or INIT, which is a system reset without a graceful
shutdown (normally referred to in this manual simply as a system reset).
This allows packages to move quickly to another node, protecting the
integrity of the data.
A system reset occurs if a cluster node cannot communicate with the
majority of cluster members for the predetermined time, or under other
circumstances such as a kernel hang or failure of the cluster daemon
(cmcld).
The case is covered in more detail under “What Happens when a Node
Times Out” on page 126. See also “Cluster Daemon: cmcld” on page 60.
A system reset is also initiated by Serviceguard itself under specific
circumstances; see “Responses to Package and Service Failures” on
page 129.
What Happens when a Node Times Out
Each node sends a heartbeat message to the cluster coordinator every
HEARTBEAT_INTERVAL number of microseconds (as specified in the
cluster configuration file). The cluster coordinator looks for this message
from each node, and if it does not receive it within NODE_TIMEOUT
microseconds, the cluster is reformed minus the node no longer sending
heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT
entries under “Cluster Configuration Parameters” on page 156 for advice
about configuring these parameters.)
On a node that is not the cluster coordinator, and on which a node
timeout occurs (that is, no heartbeat message has arrived within
NODE_TIMEOUT seconds), the following sequence of events occurs:
1. The node tries to reform the cluster.