Managing Serviceguard Sixteenth Edition, March 2009

Responses to Failures
Serviceguard responds to different kinds of failures in specific ways. For most hardware
failures, the response is not user-configurable, but for package and service failures,
you can choose the system’s response, within limits.
System Reset When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or
INIT, which is a system reset without a graceful shutdown (normally referred to in
this manual simply as a system reset). This allows packages to move quickly to another
node, protecting the integrity of the data.
A system reset occurs if a cluster node cannot communicate with the majority of cluster
members for the predetermined time, or under other circumstances such as a kernel
hang or failure of the cluster daemon (cmcld).
The case is covered in more detail under “What Happens when a Node Times Out”
(page 118). See also “Cluster Daemon: cmcld” (page 55).
A system reset is also initiated by Serviceguard itself under specific circumstances; see
“Responses to Package and Service Failures ” (page 120).
What Happens when a Node Times Out
Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth
of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You
configure MEMBER_TIMEOUT in the cluster configuration file (see “Cluster
Configuration Parameters (page 138)); the heartbeat interval is not directly configurable.
If a node fails to send a heartbeat message within the time set by MEMBER_TIMEOUT,
the cluster is reformed minus the node no longer sending heartbeat messages.
When a node detects that another node has failed (that is, no heartbeat message has
arrived within MEMBER_TIMEOUT microseconds), the following sequence of events
occurs:
1. The node contacts the other nodes and tries to re-form the cluster without the
failed node.
2. If the remaining nodes are a majority or can obtain the cluster lock, they form a
new cluster without the failed node.
3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt
(system reset).
Example
Situation. Assume a two-node cluster, with Package1 running on SystemA and
Package2 running on SystemB. Volume group vg01 is exclusively activated on
SystemA; volume group vg02is exclusively activated on SystemB. Package IP
addresses are assigned to SystemA and SystemB respectively.
118 Understanding Serviceguard Software Components