Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
14
In order to provide this type of availability, a system must be designed in a reliable manner, have
processes in place to ensure rapid recovery from faults and must be used only within its design
parameters. Additionally, the system must be well managed and secure from unauthorized use and
activities.
3.2 Terminology
There are many related terms in the context of reliability engineering principles. In order to
effectively present and use these terms some definitions are required. Here, we’ll apply these terms
in the following narration to be able to use each term in a context that is consistent with the goals
and positioning of this paper.
A system is composed of a collection of interacting components. A component may itself be a
system, or it may be just a singular component. Components are the result of system decomposition
chiefly motivated to aid in the partitioning of complex systems for either technical, or very often,
for organizational or business reasons. Decomposition of systems into components is a recursive
exercise. A component that is not decomposed further is called an atomic component. Components
are typically delineated by the careful specification of their inputs and outputs. A system provides
one or more services to its consumers. A service is the output of a system that meets the
specification for which the system was devised, or which agrees with what system users have
perceived the correct values to be [Lyu96].
A failure in a system occurs when the consumer (human or non-human) of a service is affected by
the fact that the system has not delivered the expected service. Failures are incorrect results with
respect to a specification or unexpected behavior perceived by the consumer or user of a service.
The cause of a failure is said to be a fault. Faults are identified or detected in some manner either
by the system or by its users. Finally, an error is a discrepancy between the computed, measured,
or observed value or condition and the correct, or specified value or condition. Errors are often the
result of exceptional conditions or unexpected interference. If an error occurs, then a failure of
some form has occurred. Similarly, if a failure has occurred, then an error of some form has
occurred. Since the difference between an error and a failure is very subtle, the remainder of this
document will treat the terms synonymously.
Note: Faults are active when they produce an error. Faults may also be latent. A latent fault is one that
has not yet been detected.
Since faults, when activated, cause errors, it is important to detect not only active faults, but also
the latent faults. Finding and removing these faults before they become active leads to less
downtime and higher availability.
Table 2. Service Expectations
Service Type Service Expectation
Telephone System
Dial tone within seconds of lifting receiver
Call completion (ring or busy) within a second of dialing the last digit
Call does not get dropped
Web Site First visual frame within a few seconds
Financial Backend No lost transactions