Managing Serviceguard 14th Edition, June 2007

Understanding Serviceguard Software Components
Responses to Failures
Chapter 3 129
NOTE In a very few cases, Serviceguard will attempt to reboot the system
before a system reset when this behavior is specified. If there is enough
time to flush the buffers in the buffer cache, the reboot succeeds, and a
system reset does not take place. Either way, the system will be
guaranteed to come down within a predetermined number of seconds.
“Package Configuration File Parameters” on page 183 provides advice on
choosing appropriate failover behavior.
Service Restarts
You can allow a service to restart locally following a failure. To do this,
you indicate a number of restarts for each service in the package control
script. When a service starts, the variable RESTART_COUNT is set in the
service's environment. The service, as it executes, can examine this
variable to see whether it has been restarted after a failure, and if so, it
can take appropriate action such as cleanup.
Network Communication Failure
An important element in the cluster is the health of the network itself.
As it continuously monitors the cluster, each node listens for heartbeat
messages from the other nodes confirming that all nodes are able to
communicate with each other. If a node does not hear these messages
within the configured amount of time, a node timeout occurs, resulting in
a cluster re-formation and later, if there are still no heartbeat messages
received, a system reset.