Serviceguard Network Manager: Inbound Failure Detection, March 2007

2. Serviceguard Network Manager starts polling from the problem NIC to all other NICs on the same
bridged network in entire the cluster. This is known as “full polling.”
3. If the NIC gets a response from any peer NIC, then the NIC with the initial problem is considered
healthy and there is no need for a failover. This situation can happen in the broken cascaded
cable example shown in Figure 3.
4. At any time, if Serviceguard Network Manager detects that the NIC with the initial problem is
again receiving messages from its original poller full polling is discontinued.
5. If full polling does not increment the inbound statistics for a NIC for a predetermined period of
time, the NIC is not able to communicate with the rest of the network. Several possible failures
could cause this problem, but most can be corrected by a LAN failover. Serviceguard Network
Manager will make the assumption that a local failover is needed, mark the NIC failed, and start
the local switch procedure. If there is no standby LAN, the packaged application can fail over to
another node if the affected subnet is configured in its package’s SUBNET parameter and if the
other node is configured to run it.
6. The NIC will be identified as fully functioning when it can both send and receive messages again
consistently. This procedure guards against continual failures (for example, every few seconds).
However, if failures happen more intermittently (for example, if several minutes pass between each
failure) Serviceguard will fail the connection back and forth, since there is no way to tell if the
problem is transient. This is the rule with both INONLY_OR_INOUT. For INONLY_OR_INOUT, it
applies whether the NIC failed due to inbound traffic only or both inbound and outbound traffic.
The switchback procedure is started once the NIC is functioning again if it is the primary NIC.
The following table summarizes Serviceguard Network Manager behaviors for each type of NIC
failure.
INOUT INONLY_OR_INOUT
Inbound fails NIC not marked down NIC marked down
Outbound fails NIC marked down upon driver notice NIC marked down upon driver notice
Both fail NIC marked down NIC marked down
Risks associated with INONLY_OR_INOUT
There are cases when the INONLY_OR_INOUT setting can actually make the situation worse by
failing-over when the inbound traffic has stopped, especially when the network configuration is not
highly available.
Examples of network configurations with risks
In all of the following examples, the assumption is that each happen during a time when the only
traffic generated and contributing to the statistics are Serviceguard polling messages. Also, assume all
subnets in the cluster are monitored in the packages.
Example 1: Single-point-of-failure created by INONLY_OR_INOUT—Use of the INONLY_OR_INOUT
setting for network failure detection functionality in this environment could actually create a single–
point-of-failure that would not exist if the default setting was used
. Figure 4 shows the
INONLY_OR_INOUT setting being used with Switch C connected to the network with no redundancy.
In this case, Switch E becomes a single-point-of-failure. If Switch E fails, the whole subnet will be
identified as failed, and clients will not be able to access the applications running on either cluster
member node. This configuration is better suited for the default setting of INOUT, which would not
identify any of the NICs as failed and applications on Node A would continue to be available to
clients.
6