User manual
Failover Time
The time for failover is typically about one second which means that clients may experience a
failover as a slight burst of packet loss. In the case of TCP, the failover time is well within the range
of normal retransmit timeouts so TCP will retransmit the lost packets within a very short space of
time, and continue communication. UDP does not allow retransmission since it is inherently an
unreliable protocol.
Shared IP Addresses and ARP
Both master and slave know about the shared IP address. ARP queries for the shared IP address, or
any other IP address published via the ARP configuration section or through Proxy ARP, are
answered by the active system. The hardware address of the shared IP address and other published
addresses are not related to the actual hardware addresses of the interfaces. Instead the MAC address
is constructed by NetDefendOS from the Cluster ID in the form 10-00-00-C1-4A-nn where nn is
derived by combining the Cluster ID configured in the Advanced Settings section with the hardware
bus/slot/port of the interface. The Cluster ID must be unique for each cluster in a network.
As the shared IP address always has the same hardware address, there will be no latency time in
updating ARP caches of units attached to the same LAN as the cluster when failover occurs.
When a cluster member discovers that its peer is not operational, it broadcasts gratuitous ARP
queries on all interfaces using the shared hardware address as the sender address. This allows
switches to re-learn within milliseconds where to send packets destined for the shared address. The
only delay in failover therefore, is detecting that the active unit is down.
ARP queries are also broadcast periodically to ensure that switches do not forget where to send
packets destined for the shared hardware address.
HA with Anti-Virus and IDP
If a NetDefendOS cluster has the Anti-Virus or IDP subsystems enabled then updates to the
Anti-Virus signature database or IDP pattern database will routinely occur. These updates involve
downloads from the external D-Link databases and they require NetDefendOS reconfiguration to
occur for the new database contents to become active.
A database update causes the following sequence of events to occur in an HA cluster:
1. The active (master) unit downloads the new database files from the D-Link servers. The
download is done via the shared IP address of the cluster.
2. The active (master) node sends the new database files to the inactive peer.
3. The inactive (slave) unit reconfigures to activate the new database files.
4. The active (master) unit now reconfigures to activate the new database files causing a failover
to the slave unit. The slave is now the active unit.
5. After reconfiguration of the master is complete, failover occurs again so that the master once
again becomes the active unit.
Dealing with Sync Failure
An unusual situation that can occur in an HA cluster is if the sync connection between the master
and slave experiences a failure with the result that heartbeats and state updates are no longer
received by the inactive unit.
Should such a failure occur then the consequence is that both units will continue to function but they
11.2. HA Mechanisms Chapter 11. High Availability
425










