Providing Open Architecture High Availability Solutions

6.1.2 Objective

The objective of fault detection is to detect when a fault occurs, and pass information on the fault to

the components responsible for diagnosis, isolation and recovery. This information would include

the location and type of fault, time of occurrence, and perhaps the most likely next affected

component. For example, if a fault occurs in a multiplication subroutine, it is useful to also know

which routine is expecting the result.

6.1.3 Concepts

Active Faults. Active faults are faults that have been detected, but have yet to be isolated and/or

recovered from. Active faults may or may not degrade service availability. A system can maintain a

level of service availability if the fault resides in a component that is inactive, has a backup, or has

load-balancing capability. In this case the detected fault may remain active while the system is still

delivering an acceptable level of service. An example of an active fault is when an active host

system master board crashes. If this CPU board is contained in a system that has a redundant

backup system master, then the backup host may detect that the redundant host is not functioning,

at which point isolation and recovery activities may begin. As shown in the previous example, once

an active fault is detected the fault management state will transition so that the faulty component

can be isolated in an attempt to preserve service availability.

Latent Faults. A latent fault is a fault that eludes the detection schema and remains undetected for

a period of time. Sources of latency include inactive components within the system, and uncovered

fault scenarios. A latent fault, for example, may occur in a N+1 configuration with limited fault

detection if the backup component fails. In this case the failed standby component will not be

detected until another component failure occurs and the backup component is brought online. It

should be noted that a latent fault is not impacting system serviceability when the faulty component

or subsystem is not being exercised.

Fault Detector. A fault detector is a hardware or software component that checks for faults. A fault

detector is triggered when that detector recognizes a fault. The term “audit” is also used to refer to

this type of component.

Direct and Indirect Detection. In direct detection, a fault detector finds an error in the output of a

component. The component with the erroneous output is typically faulted. Indirect detection looks

for more system-centric errors, such as high temperature or excessive memory use or CPU time. In

this case, it is only known that a problem exists. The cause of the problem may be one or more

components and diagnosis of the problem is needed before the cause of the problem can be

determined.

Detection Frequency. Faults can be detected either synchronously or asynchronously. As the

frequency of synchronous detection increases, the system increasingly exhibits degraded

performance.

6.1.4 Approach

Fault detection, in its most basic form, is simply the ability to detect that abnormal conditions exist

within the system. Detection may be by direct observation – correlating multiple events in location

or time, or by inference – by observing other behavior of the system. Events in some circumstances

may lead to the determination of a fault, while in other circumstances, it may be treated as normal

system behavior.