Providing Open Architecture High Availability Solutions

precaution is taken to build reliable systems because they cannot tolerate the repair intervals of

failures. For most other applications, however, the ability to provide near continuous service by

repairing faults and preventing their propagation is more economical and still acceptable by its

users.

Even when building to the toughest requirements for extra-high reliability, one has to consider the

presence, containment, and restoration of service due to failure. Most engineers share the opinion

that non-faulty systems do not exist; there are only those systems that have not yet failed.

3.3.2 What Compromises a System’s Reliability

Faults, errors, and failures are the conditions that ultimately compromise a system’s reliability and

in turn its availability. Remembering that a failure is a reflection of an unacceptable or incorrect

result delivered by a system as perceived by its user(s), failures can be classified into three

viewpoints [Lapr92] as shown in the following figures.

These failure classes are a generalization of the ways in which a system may fail. Failures that are

related to value and timing failures are regarded as halting failures. They reflect the absence of

activity from the system and are often caused for example by a failed component. Other failures,

often the result of development faults, manifest themselves as either consistent failures, or as

Byzantine failures. That is, they are often easily reproducible or inconsistent, the latter typically

being caused by latent faults. A system that generally only exhibits benign failures is sometimes

referred to as a fail-safe system.

Faults and their sources are varied and diverse. In ‘Dependability: Basic Concepts and

Terminology, Dependable Computing and Fault Tolerant Systems,’ [Lapr92] Laprie classifies them

according to five main viewpoints shown in Figure 2.

Figure 1. Failure Classes