Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
52
6.1.2 Objective
The objective of fault detection is to detect when a fault occurs, and pass information on the fault to
the components responsible for diagnosis, isolation and recovery. This information would include
the location and type of fault, time of occurrence, and perhaps the most likely next affected
component. For example, if a fault occurs in a multiplication subroutine, it is useful to also know
which routine is expecting the result.
6.1.3 Concepts
Active Faults. Active faults are faults that have been detected, but have yet to be isolated and/or
recovered from. Active faults may or may not degrade service availability. A system can maintain a
level of service availability if the fault resides in a component that is inactive, has a backup, or has
load-balancing capability. In this case the detected fault may remain active while the system is still
delivering an acceptable level of service. An example of an active fault is when an active host
system master board crashes. If this CPU board is contained in a system that has a redundant
backup system master, then the backup host may detect that the redundant host is not functioning,
at which point isolation and recovery activities may begin. As shown in the previous example, once
an active fault is detected the fault management state will transition so that the faulty component
can be isolated in an attempt to preserve service availability.
Latent Faults. A latent fault is a fault that eludes the detection schema and remains undetected for
a period of time. Sources of latency include inactive components within the system, and uncovered
fault scenarios. A latent fault, for example, may occur in a N+1 configuration with limited fault
detection if the backup component fails. In this case the failed standby component will not be
detected until another component failure occurs and the backup component is brought online. It
should be noted that a latent fault is not impacting system serviceability when the faulty component
or subsystem is not being exercised.
Fault Detector. A fault detector is a hardware or software component that checks for faults. A fault
detector is triggered when that detector recognizes a fault. The term “audit” is also used to refer to
this type of component.
Direct and Indirect Detection. In direct detection, a fault detector finds an error in the output of a
component. The component with the erroneous output is typically faulted. Indirect detection looks
for more system-centric errors, such as high temperature or excessive memory use or CPU time. In
this case, it is only known that a problem exists. The cause of the problem may be one or more
components and diagnosis of the problem is needed before the cause of the problem can be
determined.
Detection Frequency. Faults can be detected either synchronously or asynchronously. As the
frequency of synchronous detection increases, the system increasingly exhibits degraded
performance.
6.1.4 Approach
Fault detection, in its most basic form, is simply the ability to detect that abnormal conditions exist
within the system. Detection may be by direct observation – correlating multiple events in location
or time, or by inference – by observing other behavior of the system. Events in some circumstances
may lead to the determination of a fault, while in other circumstances, it may be treated as normal
system behavior.