Troubleshooting guide

Chapter 9 Troubleshooting Active Network Management Fail-over in High Availability Applications
Advanced Technical Reference Guide 4.1 June 2000 102
Table 1: HA Cluster machine states
State: Explanation:
DEAD
INIT (In practice this is very similar to DEAD.)
STANDBY (Possible in HA modes only, not in Load Balancing (LB) mode.)
READY This is a transient state that should usually not last more than a fraction of a second. This state is
used when a machine wants to change its state to ACTIVE. It first changes its state to READY,
and when this state is confirmed by all other (not dead) machines in the cluster the state of the
machine is changed to ACTIVE.
ACTIVE The machine is filtering packets. In HA modes this means all packets. In LB mode every active
machine filters some of the connections.
The state of a machine is usually determined by the machine itself (other machines only record the state
reported). However, in two cases a machine may determine the state of another machine:
If machine A did not hear from machine B for more than 1 second, machine A changes the state of machine B
to DEAD. Before doing so, about 0.7 seconds after machine A last heard from machine B, machine A sends
FWHAP_QUERY packets, every 0.1 seconds to machine B. This means that even if the timer on machine B is
not accurate, or one of the FWHAP_MY_STATE packets it sent did not reach machine A, it should not be
deduced to be DEAD while still alive.
Machine A may refuse to confirm the state of machine B. This does not block machine B from being in that
state but does not allow it to change to a higher state. This is usually used to block a machine from changing
from READY to ACTIVE (by not confirming the READY State).
In HA mode exactly one machine should be active at a time. Two machines may never be ACTIVE at the same
time. When one machine goes down and the other goes UP there may be a short period of time, typically
probably no more than the round trip time between machines in the cluster, at which one machine is READY
but none are ACTIV.
Except for the obvious machine failure, in which the machine cannot send any more packets (and therefore is
detected as DEAD by the timeout mechanism described above), there may be other situations in which we
would not like the machine to remain active (and to fail over to a stand-by machine). This is implemented by
allowing problems to be reported to the HA module.
Problem Detection Devices
A problem is reported by a "Problem Detection Device" by indicating the "highest" state which this device
allows the HA module to be in (i.e. DEAD < INIT < STANDBY < READY < ACTIVE). For example, when an
interface problem is detected by the interface active check device (a built-in problem detection device, see
Interface Active Check Device below), it blocks the state of the HA module at DEAD. When the interfaces are
again OK, the interface active check device reports a blocking state of "ACTIVE" (in effect allowing all state).
This does not change the state of the machine to ACTIVE. It only allows it. The machine may either be blocked
by other devices or may remain in STANDBY State because another machine is active.
Interface Active Check Device
The interface active check is a built-in problem detection device that is one of the components of the HA
mechanism. The cluster initiates a packet (FWHAP_MY_STATE) that run through the control interfaces of all the
modules and checks the status of the interfaces.
Problem Notification Device (pnot)
The Problem Notification Device (pnot) device allows external devices to register and report problems
through it to the HA module.