Managing HP Serviceguard A.12.00.00 for Linux, June 2014

shutdown, using the sg_persist command. This command is available, and has a manpage,

on both Red Hat 5, Red Hat 6, and SUSE 11 .

Serviceguard makes a PR of type Write Exclusive Registrants Only (WERO) on the package's LUN

devices. This gives read access to any initiator regardless of whether the initiator is registered or

not, but grants write access only to those initiators who are registered. (WERO is defined in the

SPC-3 standard.)

All initiators on each node running the package register with LUN devices using the same PR Key,

known as the node_pr_key. Each node in the cluster has a unique node_pr_key, which you

can see in the output of cmviewcl -f line; for example:

...

node:bla2|node_pr_key=10001

When a failover package starts up, any existing PR keys and reservations are cleared from the

underlying LUN devices first; then the node_pr_key of the node that the package is starting on

is registered with each LUN.

In the case of a multi-node package, the PR reservation is made for the underlying LUNs by the

first instance of the package, and the appropriate node_pr_key is registered each time the

package starts on a new node. If a node fails, the instances of the package running on other nodes

will remove the registrations of the failed node.

You can use cmgetpkgenv (1m) to see whether PR is enabled for a given package; for example:

cmgetpkgenv pkg1

...

PKG_PR_MODE="pr_enabled"

3.8 Responses to Failures

Serviceguard responds to different kinds of failures in specific ways. For most hardware failures,

the response is not user-configurable, but for package and service failures, you can choose the

system’s response, within limits.

3.8.1 Reboot When a Node Fails

The most dramatic response to a failure in a Serviceguard cluster is a system reboot. This allows

packages to move quickly to another node, protecting the integrity of the data.

A reboot is done if a cluster node cannot communicate with the majority of cluster members for

the pre-determined time, or under other circumstances such as a kernel hang or failure of the cluster

daemon (cmcld). When this happens, you may see the following message on the console:

DEADMAN: Time expired, initiating system restart.

The case is covered in more detail under “What Happens when a Node Times Out”. See also

“Cluster Daemon: cmcld” (page 32).

A reboot is also initiated by Serviceguard itself under specific circumstances; see “Responses to

Package and Service Failures ” (page 75).

3.8.1.1 What Happens when a Node Times Out

Each node sends a heartbeat message to all other nodes at an interval equal to one-fourth of the

value of the configured MEMBER_TIMEOUT or 1 second, whichever is less. You configure

MEMBER_TIMEOUT in the cluster configuration file; see “Cluster Configuration Parameters”

(page 89). The heartbeat interval is not directly configurable. If a node fails to send a heartbeat

message within the time set by MEMBER_TIMEOUT, the cluster is reformed minus the node no

longer sending heartbeat messages.

When a node detects that another node has failed (that is, no heartbeat message has arrived

within MEMBER_TIMEOUT microseconds), the following sequence of events occurs:

3.8 Responses to Failures 73