Managing Serviceguard Nineteenth Edition, Reprinted June 2011

ManualsBrandsHP ManualsSoftwareHP Serviceguard Software

Table 5 Pros and Cons of Volume Managers with Serviceguard (continued)

TradeoffsAdvantagesProduct

•

Supports shared activation.

• Requires purchase of additional license

• Supports exclusive activation.

• No support for RAID 5

• Supports activation in different modes on

different nodes at the same time

• CVM requires all nodes to have

connectivity to the shared disk groups

• RAID 1+0 mirrored stripes

• Not currently supported on all versions

of HP-UX

• RAID 0+1 striped mirrors

• CVM versions 4.1 and later support the

Veritas Cluster File System (CFS)

Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware failures,

the response is not user-configurable, but for package and service failures, you can choose the

system’s response, within limits.

System Reset When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or INIT, which

is a system reset without a graceful shutdown (normally referred to in this manual simply as a

system reset). This allows packages to move quickly to another node, protecting the integrity of

the data.

A system reset occurs if a cluster node cannot communicate with the majority of cluster members

for the predetermined time, or under other circumstances such as a kernel hang or failure of the

cluster daemon (cmcld).

The case is covered in more detail under “What Happens when a Node Times Out” (page 85).

See also “Cluster Daemon: cmcld” (page 39).

A system reset is also initiated by Serviceguard itself under specific circumstances; see “Responses

to Package and Service Failures ” (page 87).

What Happens when a Node Times Out

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the

value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You configure

MEMBER_TIMEOUT in the cluster configuration file (see “Cluster Configuration Parameters ”

(page 105)); the heartbeat interval is not directly configurable. If a node fails to send a heartbeat

message within the time set by MEMBER_TIMEOUT, the cluster is reformed minus the node no

longer sending heartbeat messages.

When a node detects that another node has failed (that is, no heartbeat message has arrived

within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:

1. The node contacts the other nodes and tries to re-form the cluster without the failed node.

2. If the remaining nodes are a majority or can obtain the cluster lock, they form a new cluster

without the failed node.

3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt (system reset).

Example

Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running

on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is

exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB

respectively.

Responses to Failures 85