Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
56
Granularity of Diagnosis. The objectives for diagnostics in a particular system determine the
granularity. From the perspective of service availability in a system with redundant components,
the diagnosis at a minimum must be able to identify which system component failed. More
granularity may be required to support the recovery and notification actions. There is a trade-off
between granularity and performance, as diagnosis to a very fine granularity can use a large
amount of processing power.
6.2.3 Approach
Systems are comprised of a broad range of resources that can fail, including hardware, operating
systems, applications and peripherals. In systems designed for service availability, many of these
resources are redundant. To enable rapid response in the event of failure, the first priority of
diagnosis is to identify which component failed and needs to be isolated and repaired or replaced.
The detection of a fault in a component that has redundancy, even if it has not yet been diagnosed
to a granular level, is in many cases enough of a reason to switch operation over to the redundant
component.
Faults can be diagnosed in several areas of the system. A component may have the ability to
perform local, self-diagnosis. A failed component may also be diagnosed by a peer, another
component, the operating system or the management middleware. Regardless of where the
diagnosis is performed, it needs to identify the failed component and provide the results of the
diagnosis to the decision-making entity within the system to allow proper recovery.
If the detection is local and there is only one failure mode, the diagnosis is implicit in the detection
event. If there are multiple failure modes, additional work is required to further query the resource
or to evaluate information that has already been collected for that component.
6.2.4 Techniques
Diagnostic procedures can be run by the component itself, a peer to the component, or by another
program that is focused on diagnostics. Some diagnostic procedures may need OS kernel
privileges, so they may have to be implemented at least partially within the OS.
Implicit Diagnosis and Self-Diagnosis
Implicit diagnosis is the simplest form of diagnosis. It is used when a component reports itself as
failing, or a fault detector that looks for single modes of failure is activated. The fault directly
implies the component that has failed or is out of specification.
Failures such as low fan speed or low disk-drive speed are typically done with implicit diagnosis,
as only one component could cause that failure. Using indirect detectors usually make implicit
diagnosis impossible. For example, high CPU temperature could be caused by low fan speed,
blocked air vents, or a physical failure of the CPU or its heatsink. In this case further diagnosis
must be done to determine the cause, and therefore the isolation and recovery process.
Self-diagnosis within a component is a subset of implicit diagnosis. When a component can find
faults within its own operation, and then report those faults to the system, it is implicit that the
component reporting the failure is the one that is failing