Managing Serviceguard 11th Edition, Version A.11.16, Second Printing June 2004

Understanding Serviceguard Software Components
Responses to Failures
Chapter 3 121
If you wish, you can modify this default behavior by specifying that the
node should crash (TOC) before the transfer takes place. (In a very few
cases, Serviceguard will attempt to reboot the system prior to a TOC
when this behavior is specified.) If there is enough time to flush the
buffers in the buffer cache, the reboot is successful, and a TOC does not
take place. Either way, the system will be guaranteed to come down
within a predetermined number of seconds.
In cases where package shutdown might hang, leaving the node in an
unknown state, the use of a Failfast option can provide a quick failover,
after which the node will be cleaned up on reboot. Remember, however,
that when the node crashes, all packages on the node are halted
abruptly.
The settings of node and service failfast parameters during package
configuration will determine the exact behavior of the package and the
node in the event of failure. The section on “Package Configuration
Parameters” in the “Planning” chapter contains details on how to choose
an appropriate failover behavior.
Service Restarts
You can allow a service to restart locally following a failure. To do this,
you indicate a number of restarts for each service in the package control
script. When a service starts, the variable RESTART_COUNT is set in the
service's environment. The service, as it executes, can examine this
variable to see whether it has been restarted after a failure, and if so, it
can take appropriate action such as cleanup.
Network Communication Failure
An important element in the cluster is the health of the network itself.
As it continuously monitors the cluster, each node listens for heartbeat
messages from the other nodes confirming that all nodes are able to
communicate with each other. If a node does not hear these messages
within the configured amount of time, a node timeout occurs, resulting in
a cluster re-formation and later, if there are still no heartbeat messages
received, a TOC. In a two-node cluster, the use of an RS-232 line prevents
a TOC from the momentary loss of heartbeat on the LAN due to network
saturation. The RS232 line also assists in quickly detecting network
failures when they occur.