Managing Serviceguard Sixteenth Edition, March 2009

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

111

112

113

114

115

116

117

118

119

120

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware

failures, the response is not user-configurable, but for package and service failures,

you can choose the system’s response, within limits.

System Reset When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or

INIT, which is a system reset without a graceful shutdown (normally referred to in

this manual simply as a system reset). This allows packages to move quickly to another

node, protecting the integrity of the data.

A system reset occurs if a cluster node cannot communicate with the majority of cluster

members for the predetermined time, or under other circumstances such as a kernel

hang or failure of the cluster daemon (cmcld).

The case is covered in more detail under “What Happens when a Node Times Out”

(page 118). See also “Cluster Daemon: cmcld” (page 55).

A system reset is also initiated by Serviceguard itself under specific circumstances; see

“Responses to Package and Service Failures ” (page 120).

What Happens when a Node Times Out

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth

of the value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You

configure MEMBER_TIMEOUT in the cluster configuration file (see “Cluster

Configuration Parameters ” (page 138)); the heartbeat interval is not directly configurable.

If a node fails to send a heartbeat message within the time set by MEMBER_TIMEOUT,

the cluster is reformed minus the node no longer sending heartbeat messages.

When a node detects that another node has failed (that is, no heartbeat message has

arrived within MEMBER_TIMEOUT microseconds), the following sequence of events

occurs:

1. The node contacts the other nodes and tries to re-form the cluster without the

failed node.

2. If the remaining nodes are a majority or can obtain the cluster lock, they form a

new cluster without the failed node.

3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt

(system reset).

Example

Situation. Assume a two-node cluster, with Package1 running on SystemA and

Package2 running on SystemB. Volume group vg01 is exclusively activated on

SystemA; volume group vg02is exclusively activated on SystemB. Package IP

addresses are assigned to SystemA and SystemB respectively.

118 Understanding Serviceguard Software Components