Serviceguard Network Manager: Inbound Failure Detection, March 2007
Figure 1. Typical network configuration for a two-node cluster
The default method has been the only method of network failure detection (INOUT) provided by
Serviceguard before version A.11.16. With this default method, Serviceguard will mark a NIC as
failed only if both inbound and outbound statistics of a NIC stop incrementing for a set amount of
time, which guards against transient failures.. With this default method, Serviceguard will not mark
the NIC bad if only inbound statistics stop incrementing, or if only outbound statistics stop
incrementing. However, if a NIC fails to send and its outbound statistics stop increasing, Serviceguard
Network Manager will usually receive an error notification from the driver and immediately mark the
NIC as failed.
Inbound failure detection enhancement
A new enhanced method of failure detection was introduced in Serviceguard A.11.16 to manage
inbound-only failures.
Why the enhancement was made
An application could hang when a NIC does not receive but continues to send and expect the
situation to be resolved by a local LAN failover. However, as described in the previous section, the
design of Serviceguard Network Manager before version A.11.16 required both inbound and
outbound message counts to stop incrementing before it would declare a NIC failed. A NIC was not
declared failed if only its inbound message count stopped increasing, because there was no way to
tell why the NIC stopped receiving. Some reasons why the message count stopped: the cascaded
cable could have been broken (see Figure 3); there was a problem with the logic inside the switch; or
there was a problem with the NIC itself.
In response to customer requests, Serviceguard enhanced the failure detection mechanism so the NIC
would be declared failed in situations where only inbound message counts stop incrementing. The
new setting of INONLY_OR_INOUT requires thorough evaluation of the cluster environment to ensure
no single point of failure causes unnecessary failover situations, which are documented later in this
paper.
3