Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
57
On-Line Diagnosis
On-line diagnosis is done while the system is running its normal tasks. This implies that the fault
which created the need for diagnosis was not fatal to the system, nor did it require that a redundant
component take over for the faulted component. Once the component suspected of having a fault is
removed from normal system operation, further diagnosis is considered off-line.
For environment-centric faults, such as temperature and voltage, it is possible to adjust various
system parameters and loading to pinpoint the faulty component. This can be done while the
system is performing its normal functions.
Intermittent faults in the system are typically best diagnosed on-line, as it maintains service
availability. Taking an entire system off-line to diagnose one component can reduce the availability
of a system as multiple redundant components would be tied up with diagnostics and unavailable
as standby components.
Examples of On-Line Diagnostics are:
• Network and Message Faults. These faults can be diagnosed by sending extra test messages.
In many cases the fault can be diagnosed without saturating the communications channel.
• Memory Faults. These faults can be diagnosed by moving programs in memory to allow
exhaustive tests on a block by block basis.
It is also possible to continuously run a set of on-line diagnostics. Doing this can be considered a
form of fault detection, followed by implicit diagnosis. Running these procedures during normal
operation reduces system performance, so the system architect must determine the appropriate
trade-off.
6.2.5 Dependencies
Proper diagnosis depends on the following items:
• The ability of the system and/or its components to provide accurate detection of faults
• Having information that shows dependencies between components (an understanding of which
components could cause a fault in which others)
• Having information about the fault detectors available to catch faults in components that are
part of the dependency tree for a faulted component
• Having diagnostic routines which are effective (best done by those with intimate knowledge of
the component)
6.3 Fault Isolation
6.3.1 Introduction
The process of isolation takes the defective component of the system out of service. The region that
is isolated must be bounded at a point where it can be removed from all interactions with the
system. The isolation is intended to insulate the system from the fault so it does not cause
secondary failures. By isolating a fault in a running system, the system can be maintained in an
operating state. Fault isolation can be accomplished in both physical boundaries as well as logical
boundaries.