Providing Open Architecture High Availability Solutions

6.0 System Capabilities — Fault Management

Managing faults in a system is typically a five-stage process.

1. Detection – The fault is found

2. Diagnosis – The cause of the fault is determined

3. Isolation – The rest of the system is protected from the fault

4. Recovery – The system is adjusted or re-started so it functions properly

5. Repair - A faulty system component is replaced

Notification of the fault occurs at many points in this process. Between each step there is

notification of the fault to the next step or steps in the process. On fault detection, notification may

occur during the diagnosis, isolation and perhaps recovery software components simultaneously.

There will also be notifications to the system model and system operator indicating the status of the

components. A general discussion on notification appears at the end of this section.

It is important to note that there are fine lines between some of the above stages. For the purpose of

avoiding overlap in discussion, this document will use the most restrictive definition for each stage:

• Detection. A fault is found, but determination of the failed component is not made

• Diagnosis. The determination of which component has failed

• Isolation. Ensuring a fault does not cause a system failure (isolation does not necessarily make

the system function correctly)

• Recovery. Restoring system to expected behavior

• Repair. Restoring a system to full capability including all redundancy

A final part of fault management is fault prediction. Fault prediction is an alternate form of fault

detection, which includes built-in diagnosis. Based on predicted faults, the system operator can be

given the opportunity to preemptively perform an on-line repair rather than wait for a fault to occur.