HP-UX HB v13.00 Ch-15 - Serviceguard
HP-UX Handbook – Rev 13.00 Page 61 (of 108)
Chapter 15 Serviceguard
October 29, 2013
Heartbeat generation and transmission is delayed by increased kernel activity. The
default NODE_TIMEOUT (pre-A.11.19) is often too small. A sign of this is when the
‘sequence #’ value skyrocket in the syslog.log. Increase the default value of
NODE_TIMEOUT (pre-A.11.19) from 2 seconds to 8 seconds or add 10 seconds to
MEMBER_TIMEOUT (A.11.19 and newer) in the cluster ASCII configuration file and
cmapplyconf the file. If the problem persists, look for syslog.log messages indicating
cmcld has not run for several seconds and treat according to SAW documents.
Excessive network traffic on heartbeat LANs. Created a dedicated heartbeat LAN and
change all STATIONARY_IP parameters in the cluster ASCII configuration file to
HEARTBEAT_IP and cmapplyconf the file. This allows Serviceguard to use all
networks to transmit heartbeat, in the event of a problematic NIC.
An overloaded system, with too much total I/O and network traffic. Performance
analysis may be in order.
An improperly configured network, for example, one with a very large routing table.
Serviceguard TOC
Serviceguard can invoke an HP-UX TOC (Transfer of Control), to halt the O/S and transfer
control to SPU microcode responsible for saving a kernel crash dump. It is not a graceful
shutdown because Serviceguard must insure integrity of disk data. The TOC vector is used when
Serviceguard does not reset a kernel safety countdown timer which it normally does periodically.
If cmcld does not keep on advancing the safety timer, the system clock will eventually over take
the safety timer. Once the system clock is equal to or beyond the safety timer, we say that the
safety timer expires. To ensure that the node will stop its HA services once the safety timer
expires, the node triggers a Serviceguard TOC to take itself out of the cluster. So in essence it is
not cmcld that initiates the TOC, it is cmcld that prevents the TOC from happening.
Before initiating the TOC the following message is logged to the kernel’s message buffer and to
the system’s console:
Serviceguard: Unable to maintain contact with cmcld daemon.
Performing TOC to ensure data integrity.
Beginning with HP-UX 11.22 this kind of information is also logged to the dumps INDEX and to
/etc/shutdownlog to make it easier to tell that a TOC was initiated by Serviceguard.
The /etc/shutdownlog will be loaded when the system boots, and show the panic message when
dump is saved to /var/adm/crash/crash.N (N increases with savecrash).
18:23 Thu Apr 24 2003. Reboot after panic: SafetyTimer expired, ...
In a very few cases, an attempt is first made to reboot the system prior to the TOC. If the reboot