Managing HP Serviceguard A.11.20.00 for Linux, June 2012

ManualsBrandsHP ManualsSoftwareHP Serviceguard for Linux ProLiant Cluster

A reboot is done if a cluster node cannot communicate with the majority of cluster members for

the pre-determined time, or under other circumstances such as a kernel hang or failure of the cluster

daemon (cmcld). When this happens, you may see the following message on the console:

DEADMAN: Time expired, initiating system restart.

The case is covered in more detail under “What Happens when a Node Times Out”. See also

“Cluster Daemon: cmcld” (page 28).

A reboot is also initiated by Serviceguard itself under specific circumstances; see “Responses to

Package and Service Failures ” (page 68).

What Happens when a Node Times Out

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the

value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You configure

MEMBER_TIMEOUT in the cluster configuration file; see “Cluster Configuration Parameters ”

(page 80). The heartbeat interval is not directly configurable. If a node fails to send a heartbeat

message within the time set by MEMBER_TIMEOUT, the cluster is reformed minus the node no

longer sending heartbeat messages.

When a node detects that another node has failed (that is, no heartbeat message has arrived

within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:

1. The node contacts the other nodes and tries to re-form the cluster without the failed node.

2. If the remaining nodes are a majority or can obtain the cluster lock, they form a new cluster

without the failed node.

3. If the remaining nodes are not a majority or cannot get the cluster lock, they halt (system reset).

Example

Situation. Assume a two-node cluster, with Package1 running on SystemA and Package2 running

on SystemB. Volume group vg01 is exclusively activated on SystemA; volume group vg02is

exclusively activated on SystemB. Package IP addresses are assigned to SystemA and SystemB

respectively.

Failure. Only one LAN has been configured for both heartbeat and data traffic. During the course

of operations, heavy application traffic monopolizes the bandwidth of the network, preventing

heartbeat packets from getting through.

Since SystemA does not receive heartbeat messages from SystemB, SystemA attempts to re-form

as a one-node cluster. Likewise, since SystemB does not receive heartbeat messages from

SystemA, SystemB also attempts to reform as a one-node cluster. During the election protocol,

each node votes for itself, giving both nodes 50 percent of the vote. Because both nodes have 50

percent of the vote, both nodes now vie for the cluster lock. Only one node will get the lock.

Outcome. Assume SystemA gets the cluster lock. SystemA re-forms as a one-node cluster. After

re-formation, SystemA will make sure all applications configured to run on an existing clustered

node are running. When SystemA discovers Package2 is not running in the cluster it will try to

start Package2 if Package2 is configured to run on SystemA.

SystemB recognizes that it has failed to get the cluster lock and so cannot re-form the cluster. To

release all resources related toPackage2 (such as exclusive access to volume group vg02 and

the Package2 IP address) as quickly as possible, SystemB halts (system reset).

NOTE: If AUTOSTART_CMCLD in /etc/rc.config.d/cmcluster ($SGAUTOSTART) is set

to zero, the node will not attempt to join the cluster when it comes back up.

For more information on cluster failover, see the white paper Optimizing Failover Time in a

Serviceguard Environment (version A.11.19 or later) at http://www.hp.com/go/

linux-serviceguard-docs -> White Papers. For troubleshooting information, see “Cluster

Re-formations Caused by MEMBER_TIMEOUT Being Set too Low” (page 234).

Responses to Failures 67