Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
50
6.0 System Capabilities — Fault Management
Managing faults in a system is typically a five-stage process.
1. Detection – The fault is found
2. Diagnosis The cause of the fault is determined
3. Isolation – The rest of the system is protected from the fault
4. Recovery – The system is adjusted or re-started so it functions properly
5. Repair - A faulty system component is replaced
Notification of the fault occurs at many points in this process. Between each step there is
notification of the fault to the next step or steps in the process. On fault detection, notification may
occur during the diagnosis, isolation and perhaps recovery software components simultaneously.
There will also be notifications to the system model and system operator indicating the status of the
components. A general discussion on notification appears at the end of this section.
It is important to note that there are fine lines between some of the above stages. For the purpose of
avoiding overlap in discussion, this document will use the most restrictive definition for each stage:
Detection. A fault is found, but determination of the failed component is not made
Diagnosis. The determination of which component has failed
Isolation. Ensuring a fault does not cause a system failure (isolation does not necessarily make
the system function correctly)
Recovery. Restoring system to expected behavior
Repair. Restoring a system to full capability including all redundancy
A final part of fault management is fault prediction. Fault prediction is an alternate form of fault
detection, which includes built-in diagnosis. Based on predicted faults, the system operator can be
given the opportunity to preemptively perform an on-line repair rather than wait for a fault to occur.