Providing Open Architecture High Availability Solutions

6.1.6 Dependencies

Fault detection is heavily dependent on facilities designed into the system infrastructure. If a

system is not designed to provide additional information or redundancy for detection of faults,

many faults may go undetected.

6.2 Diagnosis

There are two sets of operations in HA systems that use the term diagnosis. The first are the

immediate acts taken after a fault is found to isolate the fault and recover from it. The second set of

operations are those used as part of the debug and repair process. The following section discusses

only the first set of operations. The second set is covered in Section 6.5.

6.2.1 Introduction

Once a fault is detected, the problem must be diagnosed to determine the proper isolation and

recovery actions. Diagnosis analyzes one or more events and system parameters to determine the

nature and location of a fault. This step can be automatic or invoked separately as a user diagnostic.

The diagnosis may be automatically acted upon or reported to an operator.The mechanism used to

diagnose a problem depends on the type of system component (i.e., hardware, operating system,

peripheral, application, etc.) being diagnosed as well as the system component responsible for

performing the diagnosis. For example, a component such as an intelligent I/O card may be able to

self-diagnose its own problems, while a fan may be diagnosed elsewhere within the system.

In some systems, a single fault may lead to multiple errors being detected. Identifying the

originating fault (root cause analysis) is also part of diagnosis. This function is typically done after

the faults have been isolated and recovery has occurred. Further discussion on this is in Section 6.5.

6.2.2 Objective

The primary objective of diagnosis is to determine the location of a fault so that it is possible to

isolate the fault and recover from it. Section 3.4 and Figure 3 show that a fault in one component

could be caused by faults in other components.

On-line vs. Off-line Diagnosis. Faults can be either diagnosed while the system is application

ready (on-line), or when the system is not available for running applications (off-line). On-line

systems can run off-line diagnostics on a given device by restricting the target device from being

available to applications. Off-line diagnostics can refer to pre and post boot diagnostics that

consume the availability of the system, thus reducing overall service availability. It is therefore

possible to run some limited off-line diagnostics while the system is on line, but at reduced

availability. However, many off-line diagnostics have availability costs that are only justifiable for

debug, repair, or qualification functions, which are covered in Section 6.5.

Single vs. Multiple Failure Modes. A system resource may have a single failure mode or it may

have multiple failure modes. When a resource has a single failure mode, the diagnosis is implicit in

the detection event. When a resource has multiple failure modes, diagnosis is required to identify

which failure occurred.

Local vs. Global. From the standpoint of the system, a diagnosis may be performed locally within

a system resource, or it may be performed globally by another resource within the system.