Providing Open Architecture High Availability Solutions

Fault and Failure Forecasting

The ability to manage reliability futures is an instrumental part of the complete lifecycle of system

availability. Understanding the operational environment, gathering field failure data, the use of

reliability models and the analysis and interpretation of these results are all significant and

important to successfully manage availability.

Fault and failure forecasting includes the understanding of a related term called reliability growth.

Reliability growth is defined as the continued improvement of reliability in systems. It is generally

measured and achieved by the successive increase in intervals between failures. By using and

applying the methods described in this section, reliability is expected to improve over time, and

improve between system versions and variants when these methods are reused and matured.

3.3.4 The Challenges of Making Highly Reliable and

Highly Available Systems

When designing a system, requirements and a great deal of engineering typically go into

optimizing three primary dimensions; cost, performance, and dependability. The maturity,

accuracy, and repeatability of the cost and performance dimensions are well understood and

demonstrated in most industry areas. However, the ability to understand and demonstrate the

dependability of systems is generally lagging in most industries.

All of the methods described to this point are goals to strive for when designing systems for high

availability. Unfortunately, each of these goals, techniques and methods are subject to

imperfections. Contamination occurs most often through human error. Either in the design,

implementation, or use of systems, faults are inevitably created.

The complexity of modern systems has increased so significantly in recent years that it is rare that

a single company can provide all components of the system. Therefore, Commercial Off The Shelf

(COTS) components are used to help integrate functions that are not the core competency of the

company developing the system. These components, when stitched together, ultimately yield a web

of complexity in their dependencies. Often, the low level of characterization that accompanies

COTS components aggravates issues associated with building dependable systems. Hardware

components generally have a much stronger discipline in their characterization than do software

components. It is this inherent use of these components and their coupled errors that requires the

use of fault removal in systems in order to maintain high levels of availability.

The fact that software reliability engineering has not advanced as quickly as hardware reliability

engineering is further aggravated by the fact software is often used to mask hardware failures. In

many systems, user interfaces are either controlled by or presented by software; hence, they are the

barrier between the user and a system exposing failures via the loss of service or incorrect service

from those interfaces. Oftentimes, these software systems control and manage the loose coupling

between user and system, and when possible, provide failure avoidance measures.

Software reliability is similar to hardware reliability in that both are stochastic processes and can

be described by probability distributions. However, software reliability is different from hardware

reliability in that software does not wear out, burn out, or deteriorate (i.e., its reliability does not

decrease with time). Additionally, software generally benefits from reliability growth during

testing and operation since software faults can be detected and removed when software failures

occur. On the other hand, software may experience reliability decrease due to abrupt changes of its

operation usage or incorrect modifications to the software. Software is also continuously modified

throughout its life cycle. This malleability of software only increases the risk of introducing new

errors.