Providing Open Architecture High Availability Solutions

A fault occurs when a system component is not performing as expected. The severity of the fault

can be evaluated by its effect on the service availability level of the system as a whole. If a backup

component is available and is able to assume at least some of the failed component’s

responsibilities, a level of service availability is maintained. If the faulted component is required to

provide service and has no backup or load sharing capabilities, then a service interruption occurs.

The rate at which faults can be detected directly affects the time it takes for a system to recover to

its full capabilities. System designs that invest heavily in component reliability and/or fault

detection capabilities may require less investment in fault recovery capabilities. In the end, all

facets of the development and employment of a fault management system must be weighed against

each other to provide the best solution for the required level of service availability.

6.1 Detection

6.1.1 Introduction

Detection is the process of identifying an undesirable condition (fault or symptom) that may lead to

the loss of service from the system or device. Fault detection may be by direct observation –

correlating multiple events in location or time, or by inference – by observing other behavior of the

system.

Predictable levels of service availability can only be obtained through a methodical means of fault

detection. Systems, which are intended for very high levels of availability, must include highly-

responsive capabilities of fault detection to ensure that events that may lead to faults, events that

indicate developing faults, active faults, and latent faults are quickly captured. In all but rare cases,

fault detection capabilities are a required precursor to fault recovery.

Figure 12. Fault Management Flow Chart

Detection

(On-Line)

Diagnosis

Isolation

Recovery

Repair

Prediction

Notification

(Off-Line)

Diagnosis