Providing Open Architecture High Availability Solutions

Detection of faults can occur through various avenues within a system. A fault may be detected at

the source of the fault itself. There are a variety of components, which are designed so that the

component can trap or report error or out-of-tolerance conditions. These types of detected faults

can range anywhere from slight threshold incursions to complete component or resource failures.

An example of a fault, which is detected within a resource, is an intelligent power supply. It is

common practice to design power supplies with various failure states (i.e., degrade, inhibit, failure,

etc.). Based on the failure state reported, the severity of the fault can be determined along with the

fault’s impact upon system serviceability.

Faults may also be detected by system resources external to the faulted subsystem. One case of this

might be a fan failure. A non-intelligent fan itself will not indicate that it is no longer working, but

temperature sensor readings would indicate there is a fault in the cooling subsystem that needs

attention.

In a system constructed to provide a high level of service availability, it is desirable that each

component or sub-system within the system contribute to the fault detection process. This

contribution can take the form of simply reporting current status of the failed or failing fault

domain. In a more complexly-managed system the level or type of responses from various

components within the system may be used to detect a single fault. The level of complexity of the

fault detection capabilities of the system has a strong impact upon service availability.

6.1.5 Techniques

There are many ways of detecting faults. The following are some general methods that can be

applied throughout a high availability system. A system does not have to employ all of these

methods to be considered highly available, nor is this an all-inclusive list of fault detection

methods.

There is a balance between detection and performance that must be established for a given system

in a given application. It is possible to do so much checking that the main tasks progress at an

unacceptable rate. It is also possible to reduce checking to the point that faults are not detected. The

worst case is to have excessive fault detection on one data path, but little to none on another.

Another point to consider when including large numbers of fault detectors in a design is that the

fault detectors themselves can fail. For these reasons it is critical that fault detection be designed as

an integral part of the system, by those who fully understand the system.

Value Range Checking

In most applications the result of an operation must fall within a certain range. Tests can be done

for these boundary conditions to verify that the data is as expected. This concept is applied when

setting limits for temperature or airflow and when doing address calculations and operations,

particularly when it comes to stack and I/O operations.

In some cases it may be necessary to pass an expected output value range in addition to the data

normally passed between components for this type of testing to be possible.

Data Integrity Checking

Whenever data is transferred from one component to another it is possible to get corruption. This is

particularly true when data is passed between hardware components. However, since software

layers can hide the difference between local memory transfers and transfers across remote links, it

may be useful to check integrity at multiple points.