Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
15
System designers often build reliability into their platforms by building in correction mechanisms
for latent faults that concern them. These faults, when correctable, do not produce errors or a
failures since they are part of the design margins built into the system. They should still be
monitored to measure their occurrence relative to designers anticipated frequency, since excessive
occurrence of some correctable faults is often an indicator of a more catastrophic underlining latent
fault. For example voltage fluctuations that are inducing correctable data path errors.
Dependability is defined as the trustworthiness of a system such that reliance can justifiably be
placed on the service it delivers [Lapr92]. Dependability has several attributes. Availability is
defined as the readiness for usage. The continuation of service, in the absence of failure, is called
reliability. The nonoccurrence of catastrophic consequences or injury to the environment or its
users is called safety. The nonoccurrence of unauthorized disclosure of information results in
confidentiality. The nonoccurrence of improper alteration of information results in integrity. The
ability to undergo repairs and evolution provides maintainability. And finally, the association of
integrity, and confidentiality, results in security.
When we speak of reliability, it is typically quantified by its failure rate, or mean time to failure, or
MTTF. This attribute is the interval in which the system or element can provide service without
failure. It is represented as a reciprocal of the statistical mean elapsed time to its projected or
observed failure. Another attribute related to reliability is the mean time to repair, or MTTR. This
attribute represents the interval in time it takes to resume service after a failure has been
experienced. The availability is then expressed by:
Equation 1.
As you can derive from this very simple formula, the secret to high availability is either creating
very reliable elements (very high MTTFs) or creating elements that can recover from failure very
rapidly (very low MTTRs).
3.3 System Reliability and Availability
3.3.1 Reliability vs. Availability
As introduced earlier, reliability is a dependability attribute. It is a measure of the continuous
delivery of a service in the absence of failure. Reliability is most often represented as a
probabilistic number or formula that estimates the average time until failure, or MTTF. By
definition, the use of this measure is of a limited confidence since it is probabilistic.
Availability, another attribute of dependability, is a quite different measure. It is the measure of the
probability that a service is available for use at any given instant (and potentially in turn for some
interesting interval thereafter). Availability allows for service failure, with the presumption that
service restoration is imminent. The key to high availability is to minimize the restoration intervals.
In Equation 1, as the MTTR approaches zero the availability approaches 1, or 100%.
The design and development of highly reliable systems is very challenging. Because the techniques
used to design, develop, and test systems with high reliability goals are typically very expensive,
building highly reliable systems is often limited to special industries and applications. Such is the
case in many avionics, life-support, military, and aerospace programs. In many of these
environments, the presence of a failure is often deemed potentially life threatening; hence, every
MTTR
MTTF
MTTF
tyAvailabili
+
=