Managing Serviceguard Nineteenth Edition, Reprinted June 2011

Table 5 Pros and Cons of Volume Managers with Serviceguard (continued)
TradeoffsAdvantagesProduct
Supports shared activation.
Requires purchase of additional license
Supports exclusive activation.
No support for RAID 5
Supports activation in different modes on
different nodes at the same time
CVM requires all nodes to have
connectivity to the shared disk groups
RAID 1+0 mirrored stripes
Not currently supported on all versions
of HP-UX
RAID 0+1 striped mirrors
CVM versions 4.1 and later support the
Veritas Cluster File System (CFS)
Responses to Failures
Serviceguard responds to different kinds of failures in specific ways. For most hardware failures,
the response is not user-configurable, but for package and service failures, you can choose the
system’s response, within limits.
System Reset When a Node Fails
The most dramatic response to a failure in a Serviceguard cluster is an HP-UX TOC or INIT, which
is a system reset without a graceful shutdown (normally referred to in this manual simply as a
system reset). This allows packages to move quickly to another node, protecting the integrity of
the data.
A system reset occurs if a cluster node cannot communicate with the majority of cluster members
for the predetermined time, or under other circumstances such as a kernel hang or failure of the
cluster daemon (cmcld).
The case is covered in more detail under “What Happens when a Node Times Out” (page 85).
See also “Cluster Daemon: cmcld” (page 39).
A system reset is also initiated by Serviceguard itself under specific circumstances; see “Responses
to Package and Service Failures ” (page 87).
What Happens when a Node Times Out
Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the
value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You configure
MEMBER_TIMEOUT in the cluster configuration file (see “Cluster Configuration Parameters ”
(page 105)); the heartbeat interval is not directly configurable. If a node fails to send a heartbeat
message within the time set by MEMBER_TIMEOUT, the cluster is reformed minus the node no
longer sending heartbeat messages.
When a node detects that another node has failed (that is, no heartbeat message has arrived
within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:
1. The node contacts the other nodes and tries to re-form the cluster without the failed node.
2. If the remaining nodes are a majority or can obtain the cluster lock, they form a new cluster
without the failed node.
3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt (system reset).
Example
Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running
on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is
exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB
respectively.
Responses to Failures 85