Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
53
Detection of faults can occur through various avenues within a system. A fault may be detected at
the source of the fault itself. There are a variety of components, which are designed so that the
component can trap or report error or out-of-tolerance conditions. These types of detected faults
can range anywhere from slight threshold incursions to complete component or resource failures.
An example of a fault, which is detected within a resource, is an intelligent power supply. It is
common practice to design power supplies with various failure states (i.e., degrade, inhibit, failure,
etc.). Based on the failure state reported, the severity of the fault can be determined along with the
fault’s impact upon system serviceability.
Faults may also be detected by system resources external to the faulted subsystem. One case of this
might be a fan failure. A non-intelligent fan itself will not indicate that it is no longer working, but
temperature sensor readings would indicate there is a fault in the cooling subsystem that needs
attention.
In a system constructed to provide a high level of service availability, it is desirable that each
component or sub-system within the system contribute to the fault detection process. This
contribution can take the form of simply reporting current status of the failed or failing fault
domain. In a more complexly-managed system the level or type of responses from various
components within the system may be used to detect a single fault. The level of complexity of the
fault detection capabilities of the system has a strong impact upon service availability.
6.1.5 Techniques
There are many ways of detecting faults. The following are some general methods that can be
applied throughout a high availability system. A system does not have to employ all of these
methods to be considered highly available, nor is this an all-inclusive list of fault detection
methods.
There is a balance between detection and performance that must be established for a given system
in a given application. It is possible to do so much checking that the main tasks progress at an
unacceptable rate. It is also possible to reduce checking to the point that faults are not detected. The
worst case is to have excessive fault detection on one data path, but little to none on another.
Another point to consider when including large numbers of fault detectors in a design is that the
fault detectors themselves can fail. For these reasons it is critical that fault detection be designed as
an integral part of the system, by those who fully understand the system.
Value Range Checking
In most applications the result of an operation must fall within a certain range. Tests can be done
for these boundary conditions to verify that the data is as expected. This concept is applied when
setting limits for temperature or airflow and when doing address calculations and operations,
particularly when it comes to stack and I/O operations.
In some cases it may be necessary to pass an expected output value range in addition to the data
normally passed between components for this type of testing to be possible.
Data Integrity Checking
Whenever data is transferred from one component to another it is possible to get corruption. This is
particularly true when data is passed between hardware components. However, since software
layers can hide the difference between local memory transfers and transfers across remote links, it
may be useful to check integrity at multiple points.