Managing Serviceguard 13th Edition, February 2007

Understanding Serviceguard Software Components

Responses to Failures

Chapter 3 133

NOTE In a very few cases, Serviceguard will attempt to reboot the system

before a TOC when this behavior is specified. If there is enough time to

flush the buffers in the buffer cache, the reboot succeeds, and a TOC does

not take place. Either way, the system will be guaranteed to come down

within a predetermined number of seconds.

“Package Configuration File Parameters” on page 171 provides advice on

choosing appropriate failover behavior.

Service Restarts

You can allow a service to restart locally following a failure. To do this,

you indicate a number of restarts for each service in the package control

script. When a service starts, the variable RESTART_COUNT is set in the

service's environment. The service, as it executes, can examine this

variable to see whether it has been restarted after a failure, and if so, it

can take appropriate action such as cleanup.

Network Communication Failure

An important element in the cluster is the health of the network itself.

As it continuously monitors the cluster, each node listens for heartbeat

messages from the other nodes confirming that all nodes are able to

communicate with each other. If a node does not hear these messages

within the configured amount of time, a node timeout occurs, resulting in

a cluster re-formation and later, if there are still no heartbeat messages

received, a TOC.