Managing Serviceguard 13th Edition, February 2007

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3 129

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For

most hardware failures, the response is not user-configurable, but for

package and service failures, you can choose the system’s response,

within limits.

Transfer of Control (TOC) When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an

HP-UX TOC (Transfer of Control), which is an immediate system halt

without a graceful shutdown. A TOC allows packages to move quickly to

another node, protecting the integrity of the data.

A TOC occurs if a cluster node cannot communicate with the majority of

cluster members for the predetermined time, or under other

circumstances such as a kernel hang or failure of the cluster daemon

(cmcld).

The case is covered in more detail under “What Happens when a Node

Times Out” on page 129. See also “Cluster Daemon: cmcld” on page 60.

A TOC is also initiated by Serviceguard itself under specific

circumstances; see “Responses to Package and Service Failures” on

page 132.

What Happens when a Node Times Out

Each node sends a heartbeat message to the cluster coordinator every

HEARTBEAT_INTERVAL number of microseconds (as specified in the

cluster configuration file). The cluster coordinator looks for this message

from each node, and if it does not receive it within NODE_TIMEOUT

microseconds, the cluster is reformed minus the node no longer sending

heartbeat messages. (See the HEARTBEAT_INTERVAL and NODE_TIMEOUT

entries under “Cluster Configuration Parameters” on page 157 for advice

about configuring these parameters.)

On a node that is not the cluster coordinator, and on which a node

timeout occurs (that is, no heartbeat message has arrived within

NODE_TIMEOUT seconds), the following sequence of events occurs:

1. The node tries to reform the cluster.