Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
58
As noted in Section 6.0, there is a fine line between isolation and recovery. For this section, fault
isolation includes actions that prevent a fault from propagating, but do NOT make the system
function correctly. Actions that change a system from either an inoperative or degraded state to full
operation are considered fault recovery and are covered in Section 6.4. The recovery and fault
isolation steps may be combined inseparably. For example, recovery by redirection of IP addresses
is both an isolation and recovery action.
A system can maintain a level of service availability if the fault resides in a component that is
inactive, or is a backup and does not interact with the active components. In this case the detected
failure may remain active while the system is still delivering an acceptable level of service. Since
this latent fault is not impacting system service, the isolation step is considered complete in this
case.
6.3.2 Objective
The objective of fault isolation is to keep a fault from propagating to other components of the
system. This is done by removal of the device. If hardware is capable of actions like power
removal, this action is also performed. The removal also includes proper interactions with the
software components involved with the hardware that is powered off.
6.3.3 Concepts
Physical Isolation. To perform isolation at the physical level, the system must provide mechanisms
to prevent the component from interacting in the system. This will require hardware mechanisms
such as that provided in the slot control mechanisms in the PICMG 2.1 specification. This isolation
consists of disconnection from the bus and powering off the module.
Data Isolation. In a software case, physical isolation can be accomplished with Memory
Management Unit (MMU) support to prevent read, write or both on a page or memory region.
Logical Isolation. Logical isolation of a component in a hardware sense would be to remove the
device entries from the I/O subsystem so that no further interactions with the hardware are
possible. This is referred to as fencing. This can also include interactions with the device driver to
prevent interactions with the device or even removal of the device driver. Isolation in a software
sense consists of removal of the component consistent with the system’s ability. For example, the
following techniques can be used to remove a component: killing a process or task and removing it
from the process table, unloading a loadable library, or removing offending files in a file system by
renaming them or moving them to a non-used area.
6.3.4 Approach
Multiple techniques can be used for isolation. For a simple board/driver combination, the device
can be turned off and the driver removed. Then further interactions with that component will fail
but the system operation will continue. Components can be removed at various levels or
granularity and hierarchy, from complete applications and drivers down to just instructing a driver
to isolate itself.
Pushing the action of isolation lower can put it closer to the detection and thus be more responsive.
The tradeoff is found in complexity. In general, quick action is better to prevent propagation.
Localized and fast acting isolation methods must be weighed against a global impact to best
maintain service availability. For example, if a power supply was indicating that it was below a low
voltage threshold and there was more than one power supply available, then a quick reaction would
be to simply isolate that power module to prevent it from impacting the system. A more global
impact could be that the sum of the power modules were being asked to provide beyond their