Serviceguard Network Manager: Inbound Failure Detection, March 2007

Figure 1. Typical network configuration for a two-node cluster

The default method has been the only method of network failure detection (INOUT) provided by

Serviceguard before version A.11.16. With this default method, Serviceguard will mark a NIC as

failed only if both inbound and outbound statistics of a NIC stop incrementing for a set amount of

time, which guards against transient failures.. With this default method, Serviceguard will not mark

the NIC bad if only inbound statistics stop incrementing, or if only outbound statistics stop

incrementing. However, if a NIC fails to send and its outbound statistics stop increasing, Serviceguard

Network Manager will usually receive an error notification from the driver and immediately mark the

NIC as failed.

Inbound failure detection enhancement

A new enhanced method of failure detection was introduced in Serviceguard A.11.16 to manage

inbound-only failures.

Why the enhancement was made

An application could hang when a NIC does not receive but continues to send and expect the

situation to be resolved by a local LAN failover. However, as described in the previous section, the

design of Serviceguard Network Manager before version A.11.16 required both inbound and

outbound message counts to stop incrementing before it would declare a NIC failed. A NIC was not

declared failed if only its inbound message count stopped increasing, because there was no way to

tell why the NIC stopped receiving. Some reasons why the message count stopped: the cascaded

cable could have been broken (see Figure 3); there was a problem with the logic inside the switch; or

there was a problem with the NIC itself.

In response to customer requests, Serviceguard enhanced the failure detection mechanism so the NIC

would be declared failed in situations where only inbound message counts stop incrementing. The

new setting of INONLY_OR_INOUT requires thorough evaluation of the cluster environment to ensure

no single point of failure causes unnecessary failover situations, which are documented later in this

paper.