Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
55
6.1.6 Dependencies
Fault detection is heavily dependent on facilities designed into the system infrastructure. If a
system is not designed to provide additional information or redundancy for detection of faults,
many faults may go undetected.
6.2 Diagnosis
There are two sets of operations in HA systems that use the term diagnosis. The first are the
immediate acts taken after a fault is found to isolate the fault and recover from it. The second set of
operations are those used as part of the debug and repair process. The following section discusses
only the first set of operations. The second set is covered in Section 6.5.
6.2.1 Introduction
Once a fault is detected, the problem must be diagnosed to determine the proper isolation and
recovery actions. Diagnosis analyzes one or more events and system parameters to determine the
nature and location of a fault. This step can be automatic or invoked separately as a user diagnostic.
The diagnosis may be automatically acted upon or reported to an operator.The mechanism used to
diagnose a problem depends on the type of system component (i.e., hardware, operating system,
peripheral, application, etc.) being diagnosed as well as the system component responsible for
performing the diagnosis. For example, a component such as an intelligent I/O card may be able to
self-diagnose its own problems, while a fan may be diagnosed elsewhere within the system.
In some systems, a single fault may lead to multiple errors being detected. Identifying the
originating fault (root cause analysis) is also part of diagnosis. This function is typically done after
the faults have been isolated and recovery has occurred. Further discussion on this is in Section 6.5.
6.2.2 Objective
The primary objective of diagnosis is to determine the location of a fault so that it is possible to
isolate the fault and recover from it. Section 3.4 and Figure 3 show that a fault in one component
could be caused by faults in other components.
On-line vs. Off-line Diagnosis. Faults can be either diagnosed while the system is application
ready (on-line), or when the system is not available for running applications (off-line). On-line
systems can run off-line diagnostics on a given device by restricting the target device from being
available to applications. Off-line diagnostics can refer to pre and post boot diagnostics that
consume the availability of the system, thus reducing overall service availability. It is therefore
possible to run some limited off-line diagnostics while the system is on line, but at reduced
availability. However, many off-line diagnostics have availability costs that are only justifiable for
debug, repair, or qualification functions, which are covered in Section 6.5.
Single vs. Multiple Failure Modes. A system resource may have a single failure mode or it may
have multiple failure modes. When a resource has a single failure mode, the diagnosis is implicit in
the detection event. When a resource has multiple failure modes, diagnosis is required to identify
which failure occurred.
Local vs. Global. From the standpoint of the system, a diagnosis may be performed locally within
a system resource, or it may be performed globally by another resource within the system.