Serviceguard Network Manager: Inbound Failure Detection, March 2007

2. Serviceguard Network Manager starts polling from the problem NIC to all other NICs on the same

bridged network in entire the cluster. This is known as “full polling.”

3. If the NIC gets a response from any peer NIC, then the NIC with the initial problem is considered

healthy and there is no need for a failover. This situation can happen in the broken cascaded

cable example shown in Figure 3.

4. At any time, if Serviceguard Network Manager detects that the NIC with the initial problem is

again receiving messages from its original poller full polling is discontinued.

5. If full polling does not increment the inbound statistics for a NIC for a predetermined period of

time, the NIC is not able to communicate with the rest of the network. Several possible failures

could cause this problem, but most can be corrected by a LAN failover. Serviceguard Network

Manager will make the assumption that a local failover is needed, mark the NIC failed, and start

the local switch procedure. If there is no standby LAN, the packaged application can fail over to

another node if the affected subnet is configured in its package’s SUBNET parameter and if the

other node is configured to run it.

6. The NIC will be identified as fully functioning when it can both send and receive messages again

consistently. This procedure guards against continual failures (for example, every few seconds).

However, if failures happen more intermittently (for example, if several minutes pass between each

failure) Serviceguard will fail the connection back and forth, since there is no way to tell if the

problem is transient. This is the rule with both INONLY_OR_INOUT. For INONLY_OR_INOUT, it

applies whether the NIC failed due to inbound traffic only or both inbound and outbound traffic.

The switchback procedure is started once the NIC is functioning again if it is the primary NIC.

The following table summarizes Serviceguard Network Manager behaviors for each type of NIC

failure.

INOUT INONLY_OR_INOUT

Inbound fails NIC not marked down NIC marked down

Outbound fails NIC marked down upon driver notice NIC marked down upon driver notice

Both fail NIC marked down NIC marked down

Risks associated with INONLY_OR_INOUT

There are cases when the INONLY_OR_INOUT setting can actually make the situation worse by

failing-over when the inbound traffic has stopped, especially when the network configuration is not

highly available.

Examples of network configurations with risks

In all of the following examples, the assumption is that each happen during a time when the only

traffic generated and contributing to the statistics are Serviceguard polling messages. Also, assume all

subnets in the cluster are monitored in the packages.

Example 1: Single-point-of-failure created by INONLY_OR_INOUT—Use of the INONLY_OR_INOUT

setting for network failure detection functionality in this environment could actually create a single–

point-of-failure that would not exist if the default setting was used

. Figure 4 shows the

INONLY_OR_INOUT setting being used with Switch C connected to the network with no redundancy.

In this case, Switch E becomes a single-point-of-failure. If Switch E fails, the whole subnet will be

identified as failed, and clients will not be able to access the applications running on either cluster

member node. This configuration is better suited for the default setting of INOUT, which would not

identify any of the NICs as failed and applications on Node A would continue to be available to

clients.