Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
51
A fault occurs when a system component is not performing as expected. The severity of the fault
can be evaluated by its effect on the service availability level of the system as a whole. If a backup
component is available and is able to assume at least some of the failed component’s
responsibilities, a level of service availability is maintained. If the faulted component is required to
provide service and has no backup or load sharing capabilities, then a service interruption occurs.
The rate at which faults can be detected directly affects the time it takes for a system to recover to
its full capabilities. System designs that invest heavily in component reliability and/or fault
detection capabilities may require less investment in fault recovery capabilities. In the end, all
facets of the development and employment of a fault management system must be weighed against
each other to provide the best solution for the required level of service availability.
6.1 Detection
6.1.1 Introduction
Detection is the process of identifying an undesirable condition (fault or symptom) that may lead to
the loss of service from the system or device. Fault detection may be by direct observation –
correlating multiple events in location or time, or by inference – by observing other behavior of the
system.
Predictable levels of service availability can only be obtained through a methodical means of fault
detection. Systems, which are intended for very high levels of availability, must include highly-
responsive capabilities of fault detection to ensure that events that may lead to faults, events that
indicate developing faults, active faults, and latent faults are quickly captured. In all but rare cases,
fault detection capabilities are a required precursor to fault recovery.
Figure 12. Fault Management Flow Chart
Detection
(On-Line)
Diagnosis
Isolation
Recovery
Repair
Prediction
Notification
(Off-Line)
Diagnosis