Managing Serviceguard Nineteenth Edition, Reprinted June 2011

Responses to Package and Service Failures

In the default case, the failure of a failover package, or of a service within the package, causes

the package to shut down by running the control script with the ‘stop’ parameter, and then restarting

the package on an alternate node. A package will also fail if it is configured to have a dependency

on another package, and that package fails. If the package manager receives a report of an EMS

(Event Monitoring Service) event showing that a configured resource dependency is not met, the

package fails and tries to restart on the alternate node.

You can modify this default behavior by specifying that the node should halt (system reset) before

the transfer takes place. You do this by setting failfast parameters in the package configuration

file.

In cases where package shutdown might hang, leaving the node in an unknown state, failfast

options can provide a quick failover, after which the node will be cleaned up on reboot. Remember,

however, that a system reset causes all packages on the node to halt abruptly without a clean

shutdown.

The settings of the failfast parameters in the package configuration file determine the behavior of

the package and the node in the event of a package or resource failure:

• If service_fail_fast_enabledis set to yes in the package configuration file,

Serviceguard will halt the node with a system reset if there is a failure of that specific service.

• If node_fail_fast_enabled is set to yes in the package configuration file, and the

package fails, Serviceguard will halt (system reset) the node on which the package is running.

NOTE: In a very few cases, Serviceguard will attempt to reboot the system before a system reset

when this behavior is specified. If there is enough time to flush the buffers in the buffer cache, the

reboot succeeds, and a system reset does not take place. Either way, the system will be guaranteed

to come down within a predetermined number of seconds.

“Choosing Switching and Failover Behavior” (page 127) provides advice on choosing appropriate

failover behavior.

Service Restarts

You can allow a service to restart locally following a failure. To do this, you indicate a number of

restarts for each service in the package control script. When a service starts, the variable

RESTART_COUNT is set in the service’s environment. The service, as it executes, can examine this

variable to see whether it has been restarted after a failure, and if so, it can take appropriate

action such as cleanup.

Network Communication Failure

An important element in the cluster is the health of the network itself. As it continuously monitors

the cluster, each node listens for heartbeat messages from the other nodes confirming that all nodes

are able to communicate with each other. If a node does not hear these messages within the

configured amount of time, a node timeout occurs; see “What Happens when a Node Times Out”

(page 85).

Responses to Failures 87