Providing Open Architecture High Availability Solutions

On-Line Diagnosis

On-line diagnosis is done while the system is running its normal tasks. This implies that the fault

which created the need for diagnosis was not fatal to the system, nor did it require that a redundant

component take over for the faulted component. Once the component suspected of having a fault is

removed from normal system operation, further diagnosis is considered off-line.

For environment-centric faults, such as temperature and voltage, it is possible to adjust various

system parameters and loading to pinpoint the faulty component. This can be done while the

system is performing its normal functions.

Intermittent faults in the system are typically best diagnosed on-line, as it maintains service

availability. Taking an entire system off-line to diagnose one component can reduce the availability

of a system as multiple redundant components would be tied up with diagnostics and unavailable

as standby components.

Examples of On-Line Diagnostics are:

• Network and Message Faults. These faults can be diagnosed by sending extra test messages.

In many cases the fault can be diagnosed without saturating the communications channel.

• Memory Faults. These faults can be diagnosed by moving programs in memory to allow

exhaustive tests on a block by block basis.

It is also possible to continuously run a set of on-line diagnostics. Doing this can be considered a

form of fault detection, followed by implicit diagnosis. Running these procedures during normal

operation reduces system performance, so the system architect must determine the appropriate

trade-off.

6.2.5 Dependencies

Proper diagnosis depends on the following items:

• The ability of the system and/or its components to provide accurate detection of faults

• Having information that shows dependencies between components (an understanding of which

components could cause a fault in which others)

• Having information about the fault detectors available to catch faults in components that are

part of the dependency tree for a faulted component

• Having diagnostic routines which are effective (best done by those with intimate knowledge of

the component)

6.3 Fault Isolation

6.3.1 Introduction

The process of isolation takes the defective component of the system out of service. The region that

is isolated must be bounded at a point where it can be removed from all interactions with the

system. The isolation is intended to insulate the system from the fault so it does not cause

secondary failures. By isolating a fault in a running system, the system can be maintained in an

operating state. Fault isolation can be accomplished in both physical boundaries as well as logical

boundaries.