Managing Serviceguard 13th Edition, February 2007
Understanding Serviceguard Software Components
Responses to Failures
Chapter 3 129
Responses to Failures
Serviceguard responds to different kinds of failures in specific ways. For
most hardware failures, the response is not user-configurable, but for
package and service failures, you can choose the system’s response,
within limits.
Transfer of Control (TOC) When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is an
HP-UX TOC (Transfer of Control), which is an immediate system halt
without a graceful shutdown. A TOC allows packages to move quickly to
another node, protecting the integrity of the data.
A TOC occurs if a cluster node cannot communicate with the majority of
cluster members for the predetermined time, or under other
circumstances such as a kernel hang or failure of the cluster daemon
(cmcld).
The case is covered in more detail under “What Happens when a Node
Times Out” on page 129. See also “Cluster Daemon: cmcld” on page 60.
A TOC is also initiated by Serviceguard itself under specific
circumstances; see “Responses to Package and Service Failures” on
page 132.
What Happens when a Node Times Out
Each node sends a heartbeat message to the cluster coordinator every
HEARTBEAT_INTERVAL number of microseconds (as specified in the
cluster configuration file). The cluster coordinator looks for this message
from each node, and if it does not receive it within NODE_TIMEOUT
microseconds, the cluster is reformed minus the node no longer sending
heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT
entries under “Cluster Configuration Parameters” on page 157 for advice
about configuring these parameters.)
On a node that is not the cluster coordinator, and on which a node
timeout occurs (that is, no heartbeat message has arrived within
NODE_TIMEOUT seconds), the following sequence of events occurs:
1. The node tries to reform the cluster.