Managing Serviceguard Fifteenth Edition, reprinted May 2008

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3126

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For

most hardware failures, the response is not user-configurable, but for

package and service failures, you can choose the system’s response,

within limits.

System Reset When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an

HP-UX TOC or INIT, which is a system reset without a graceful

shutdown (normally referred to in this manual simply as a system reset).

This allows packages to move quickly to another node, protecting the

integrity of the data.

A system reset occurs if a cluster node cannot communicate with the

majority of cluster members for the predetermined time, or under other

circumstances such as a kernel hang or failure of the cluster daemon

(cmcld).

The case is covered in more detail under “What Happens when a Node

Times Out” on page 126. See also “Cluster Daemon: cmcld” on page 60.

A system reset is also initiated by Serviceguard itself under specific

circumstances; see “Responses to Package and Service Failures” on

page 129.

What Happens when a Node Times Out

Each node sends a heartbeat message to the cluster coordinator every

HEARTBEAT_INTERVAL number of microseconds (as specified in the

cluster configuration file). The cluster coordinator looks for this message

from each node, and if it does not receive it within NODE_TIMEOUT

microseconds, the cluster is reformed minus the node no longer sending

heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT

entries under “Cluster Configuration Parameters” on page 156 for advice

about configuring these parameters.)

On a node that is not the cluster coordinator, and on which a node

timeout occurs (that is, no heartbeat message has arrived within

NODE_TIMEOUT seconds), the following sequence of events occurs:

1. The node tries to reform the cluster.