Providing Open Architecture High Availability Solutions

As noted in Section 6.0, there is a fine line between isolation and recovery. For this section, fault

isolation includes actions that prevent a fault from propagating, but do NOT make the system

function correctly. Actions that change a system from either an inoperative or degraded state to full

operation are considered fault recovery and are covered in Section 6.4. The recovery and fault

isolation steps may be combined inseparably. For example, recovery by redirection of IP addresses

is both an isolation and recovery action.

A system can maintain a level of service availability if the fault resides in a component that is

inactive, or is a backup and does not interact with the active components. In this case the detected

failure may remain active while the system is still delivering an acceptable level of service. Since

this latent fault is not impacting system service, the isolation step is considered complete in this

case.

6.3.2 Objective

The objective of fault isolation is to keep a fault from propagating to other components of the

system. This is done by removal of the device. If hardware is capable of actions like power

removal, this action is also performed. The removal also includes proper interactions with the

software components involved with the hardware that is powered off.

6.3.3 Concepts

Physical Isolation. To perform isolation at the physical level, the system must provide mechanisms

to prevent the component from interacting in the system. This will require hardware mechanisms

such as that provided in the slot control mechanisms in the PICMG 2.1 specification. This isolation

consists of disconnection from the bus and powering off the module.

Data Isolation. In a software case, physical isolation can be accomplished with Memory

Management Unit (MMU) support to prevent read, write or both on a page or memory region.

Logical Isolation. Logical isolation of a component in a hardware sense would be to remove the

device entries from the I/O subsystem so that no further interactions with the hardware are

possible. This is referred to as fencing. This can also include interactions with the device driver to

prevent interactions with the device or even removal of the device driver. Isolation in a software

sense consists of removal of the component consistent with the system’s ability. For example, the

following techniques can be used to remove a component: killing a process or task and removing it

from the process table, unloading a loadable library, or removing offending files in a file system by

renaming them or moving them to a non-used area.

6.3.4 Approach

Multiple techniques can be used for isolation. For a simple board/driver combination, the device

can be turned off and the driver removed. Then further interactions with that component will fail

but the system operation will continue. Components can be removed at various levels or

granularity and hierarchy, from complete applications and drivers down to just instructing a driver

to isolate itself.

Pushing the action of isolation lower can put it closer to the detection and thus be more responsive.

The tradeoff is found in complexity. In general, quick action is better to prevent propagation.

Localized and fast acting isolation methods must be weighed against a global impact to best

maintain service availability. For example, if a power supply was indicating that it was below a low

voltage threshold and there was more than one power supply available, then a quick reaction would

be to simply isolate that power module to prevent it from impacting the system. A more global

impact could be that the sum of the power modules were being asked to provide beyond their