HP-UX HB v13.00 Ch-15 - Serviceguard

HP-UX Handbook – Rev 13.00 Page 62 (of 108)

Chapter 15 Serviceguard

October 29, 2013

is able to complete before the safety timer expires, then the TOC will not take place. In either

case, packages are able to move quickly to another node.

The following may cause cmcld to cease to reset the safety timer:

1. cmcld is not given CPU time to reset the timer (system hang)

2. A crucial package such as SG-CFS-pkg has failed. The admin can identify others by of setting

failfast=enabled in the cluster binary (via the package configuration file).

Clues such as these appear in OLDsyslog.log:

Oct 20 22:32:40 ndhdbp6 vmunix: Halting ndhdbp6 to preserve data integrity

Oct 20 22:32:40 ndhdbp6 vmunix: Reason: A crucial package failed

3. The node is refused the cluster lock (or quorum server or lock LUN) during an unexpected

cluster reformation. Example in OLDsyslog:

Dec 28 14:13:34 kuikka cmcld[3043]: Cluster lock was denied. Lock was obtained

by another node.

4. The node unexpectedly finds itself in a minority of nodes able to communicate with one another

(see cluster formation protocol).

5. A shutdown does not halt the cluster daemons, and ‘killall’ kills cmcld before terminating the

relationship with the safety timer. Evidence of this is found in the OLDsyslog.log. Example last

line in OLDsyslog.log:

Dec 8 17:02:39 uifxp42p syslogd: going down on signal 15

The dump will show that cmcld is not on the process list.

Admins should use cmhaltcl (or cmhaltnode –f) to halt Serviceguard daemons before

performing shutdown (per the Managing Serviceguard manual).

6. Veritas (Symantec) cluster file system (available with Serviceguard) uses a different heartbeat

which if impaired, will also force a system TOC.

As in item 4, inspect the bottom of OLDsyslog.log that preceded the TOC dump, and the dump, to find

cause of the TOC. The INDEX file of the dump identifies when the TOC was performed. Example:

dumptime 1323381767 Thu Dec 8 17:02:47 EST 2011

System Administration Errors

There are a number of errors you can make when configuring Serviceguard that will not show up

when you start the cluster. Your cluster can be running, and everything appears to be fine, until

there is a hardware or software failure and control of your packages is not transferred to another

node as you would have expected.

These are errors caused specifically by errors in the cluster configuration file and package

configuration scripts.

Examples of these errors include:

 Volume groups not defined on adoptive node.

 Mount point does not exist on adoptive node.

 Network errors on adoptive node (configuration errors).

 User information not correct on adoptive node.