Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
16
precaution is taken to build reliable systems because they cannot tolerate the repair intervals of
failures. For most other applications, however, the ability to provide near continuous service by
repairing faults and preventing their propagation is more economical and still acceptable by its
users.
Even when building to the toughest requirements for extra-high reliability, one has to consider the
presence, containment, and restoration of service due to failure. Most engineers share the opinion
that non-faulty systems do not exist; there are only those systems that have not yet failed.
3.3.2 What Compromises a Systems Reliability
Faults, errors, and failures are the conditions that ultimately compromise a systems reliability and
in turn its availability. Remembering that a failure is a reflection of an unacceptable or incorrect
result delivered by a system as perceived by its user(s), failures can be classified into three
viewpoints [Lapr92] as shown in the following figures.
These failure classes are a generalization of the ways in which a system may fail. Failures that are
related to value and timing failures are regarded as halting failures. They reflect the absence of
activity from the system and are often caused for example by a failed component. Other failures,
often the result of development faults, manifest themselves as either consistent failures, or as
Byzantine failures. That is, they are often easily reproducible or inconsistent, the latter typically
being caused by latent faults. A system that generally only exhibits benign failures is sometimes
referred to as a fail-safe system.
Faults and their sources are varied and diverse. In ‘Dependability: Basic Concepts and
Terminology, Dependable Computing and Fault Tolerant Systems,’ [Lapr92] Laprie classifies them
according to five main viewpoints shown in Figure 2.
Figure 1. Failure Classes