Providing Open Architecture High Availability Solutions

System designers often build reliability into their platforms by building in correction mechanisms

for latent faults that concern them. These faults, when correctable, do not produce errors or a

failures since they are part of the design margins built into the system. They should still be

monitored to measure their occurrence relative to designers anticipated frequency, since excessive

occurrence of some correctable faults is often an indicator of a more catastrophic underlining latent

fault. For example voltage fluctuations that are inducing correctable data path errors.

Dependability is defined as the trustworthiness of a system such that reliance can justifiably be

placed on the service it delivers [Lapr92]. Dependability has several attributes. Availability is

defined as the readiness for usage. The continuation of service, in the absence of failure, is called

reliability. The nonoccurrence of catastrophic consequences or injury to the environment or its

users is called safety. The nonoccurrence of unauthorized disclosure of information results in

confidentiality. The nonoccurrence of improper alteration of information results in integrity. The

ability to undergo repairs and evolution provides maintainability. And finally, the association of

integrity, and confidentiality, results in security.

When we speak of reliability, it is typically quantified by its failure rate, or mean time to failure, or

MTTF. This attribute is the interval in which the system or element can provide service without

failure. It is represented as a reciprocal of the statistical mean elapsed time to its projected or

observed failure. Another attribute related to reliability is the mean time to repair, or MTTR. This

attribute represents the interval in time it takes to resume service after a failure has been

experienced. The availability is then expressed by:

Equation 1.

As you can derive from this very simple formula, the secret to high availability is either creating

very reliable elements (very high MTTFs) or creating elements that can recover from failure very

rapidly (very low MTTRs).

3.3 System Reliability and Availability

3.3.1 Reliability vs. Availability

As introduced earlier, reliability is a dependability attribute. It is a measure of the continuous

delivery of a service in the absence of failure. Reliability is most often represented as a

probabilistic number or formula that estimates the average time until failure, or MTTF. By

definition, the use of this measure is of a limited confidence since it is probabilistic.

Availability, another attribute of dependability, is a quite different measure. It is the measure of the

probability that a service is available for use at any given instant (and potentially in turn for some

interesting interval thereafter). Availability allows for service failure, with the presumption that

service restoration is imminent. The key to high availability is to minimize the restoration intervals.

In Equation 1, as the MTTR approaches zero the availability approaches 1, or 100%.

The design and development of highly reliable systems is very challenging. Because the techniques

used to design, develop, and test systems with high reliability goals are typically very expensive,

building highly reliable systems is often limited to special industries and applications. Such is the

case in many avionics, life-support, military, and aerospace programs. In many of these

environments, the presence of a failure is often deemed potentially life threatening; hence, every

MTTR

MTTF

tyAvailabili