HP-UX HB v13.00 Ch-15 - Serviceguard

HP-UX Handbook Rev 13.00 Page 62 (of 108)
Chapter 15 Serviceguard
October 29, 2013
is able to complete before the safety timer expires, then the TOC will not take place. In either
case, packages are able to move quickly to another node.
The following may cause cmcld to cease to reset the safety timer:
1. cmcld is not given CPU time to reset the timer (system hang)
2. A crucial package such as SG-CFS-pkg has failed. The admin can identify others by of setting
failfast=enabled in the cluster binary (via the package configuration file).
Clues such as these appear in OLDsyslog.log:
Oct 20 22:32:40 ndhdbp6 vmunix: Halting ndhdbp6 to preserve data integrity
Oct 20 22:32:40 ndhdbp6 vmunix: Reason: A crucial package failed
3. The node is refused the cluster lock (or quorum server or lock LUN) during an unexpected
cluster reformation. Example in OLDsyslog:
Dec 28 14:13:34 kuikka cmcld[3043]: Cluster lock was denied. Lock was obtained
by another node.
4. The node unexpectedly finds itself in a minority of nodes able to communicate with one another
(see cluster formation protocol).
5. A shutdown does not halt the cluster daemons, and ‘killall’ kills cmcld before terminating the
relationship with the safety timer. Evidence of this is found in the OLDsyslog.log. Example last
line in OLDsyslog.log:
Dec 8 17:02:39 uifxp42p syslogd: going down on signal 15
The dump will show that cmcld is not on the process list.
Admins should use cmhaltcl (or cmhaltnode f) to halt Serviceguard daemons before
performing shutdown (per the Managing Serviceguard manual).
6. Veritas (Symantec) cluster file system (available with Serviceguard) uses a different heartbeat
which if impaired, will also force a system TOC.
As in item 4, inspect the bottom of OLDsyslog.log that preceded the TOC dump, and the dump, to find
cause of the TOC. The INDEX file of the dump identifies when the TOC was performed. Example:
dumptime 1323381767 Thu Dec 8 17:02:47 EST 2011
System Administration Errors
There are a number of errors you can make when configuring Serviceguard that will not show up
when you start the cluster. Your cluster can be running, and everything appears to be fine, until
there is a hardware or software failure and control of your packages is not transferred to another
node as you would have expected.
These are errors caused specifically by errors in the cluster configuration file and package
configuration scripts.
Examples of these errors include:
Volume groups not defined on adoptive node.
Mount point does not exist on adoptive node.
Network errors on adoptive node (configuration errors).
User information not correct on adoptive node.