Troubleshooting guide

ManualsBrandsBay Networks ManualsComputer equipmentBayRS

101

102

103

104

105

106

107

108

109

110

Chapter 9 Troubleshooting Active Network Management Fail-over in High Availability Applications

Advanced Technical Reference Guide 4.1 • June 2000 102

Table 1: HA Cluster machine states

State: Explanation:

DEAD

INIT (In practice this is very similar to DEAD.)

STANDBY (Possible in HA modes only, not in Load Balancing (LB) mode.)

READY This is a transient state that should usually not last more than a fraction of a second. This state is

used when a machine wants to change its state to ACTIVE. It first changes its state to READY,

and when this state is confirmed by all other (not dead) machines in the cluster the state of the

machine is changed to ACTIVE.

ACTIVE The machine is filtering packets. In HA modes this means all packets. In LB mode every active

machine filters some of the connections.

The state of a machine is usually determined by the machine itself (other machines only record the state

reported). However, in two cases a machine may determine the state of another machine:

If machine A did not hear from machine B for more than 1 second, machine A changes the state of machine B

to DEAD. Before doing so, about 0.7 seconds after machine A last heard from machine B, machine A sends

FWHAP_QUERY packets, every 0.1 seconds to machine B. This means that even if the timer on machine B is

not accurate, or one of the FWHAP_MY_STATE packets it sent did not reach machine A, it should not be

deduced to be DEAD while still alive.

Machine A may refuse to confirm the state of machine B. This does not block machine B from being in that

state but does not allow it to change to a higher state. This is usually used to block a machine from changing

from READY to ACTIVE (by not confirming the READY State).

In HA mode exactly one machine should be active at a time. Two machines may never be ACTIVE at the same

time. When one machine goes down and the other goes UP there may be a short period of time, typically

probably no more than the round trip time between machines in the cluster, at which one machine is READY

but none are ACTIV.

Except for the obvious machine failure, in which the machine cannot send any more packets (and therefore is

detected as DEAD by the timeout mechanism described above), there may be other situations in which we

would not like the machine to remain active (and to fail over to a stand-by machine). This is implemented by

allowing problems to be reported to the HA module.

Problem Detection Devices

A problem is reported by a "Problem Detection Device" by indicating the "highest" state which this device

allows the HA module to be in (i.e. DEAD < INIT < STANDBY < READY < ACTIVE). For example, when an

interface problem is detected by the interface active check device (a built-in problem detection device, see

Interface Active Check Device below), it blocks the state of the HA module at DEAD. When the interfaces are

again OK, the interface active check device reports a blocking state of "ACTIVE" (in effect allowing all state).

This does not change the state of the machine to ACTIVE. It only allows it. The machine may either be blocked

by other devices or may remain in STANDBY State because another machine is active.

Interface Active Check Device

The interface active check is a built-in problem detection device that is one of the components of the HA

mechanism. The cluster initiates a packet (FWHAP_MY_STATE) that run through the control interfaces of all the

modules and checks the status of the interfaces.

Problem Notification Device (pnot)

The Problem Notification Device (pnot) device allows external devices to register and report problems

through it to the HA module.