Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
19
Fault and Failure Forecasting
The ability to manage reliability futures is an instrumental part of the complete lifecycle of system
availability. Understanding the operational environment, gathering field failure data, the use of
reliability models and the analysis and interpretation of these results are all significant and
important to successfully manage availability.
Fault and failure forecasting includes the understanding of a related term called reliability growth.
Reliability growth is defined as the continued improvement of reliability in systems. It is generally
measured and achieved by the successive increase in intervals between failures. By using and
applying the methods described in this section, reliability is expected to improve over time, and
improve between system versions and variants when these methods are reused and matured.
3.3.4 The Challenges of Making Highly Reliable and
Highly Available Systems
When designing a system, requirements and a great deal of engineering typically go into
optimizing three primary dimensions; cost, performance, and dependability. The maturity,
accuracy, and repeatability of the cost and performance dimensions are well understood and
demonstrated in most industry areas. However, the ability to understand and demonstrate the
dependability of systems is generally lagging in most industries.
All of the methods described to this point are goals to strive for when designing systems for high
availability. Unfortunately, each of these goals, techniques and methods are subject to
imperfections. Contamination occurs most often through human error. Either in the design,
implementation, or use of systems, faults are inevitably created.
The complexity of modern systems has increased so significantly in recent years that it is rare that
a single company can provide all components of the system. Therefore, Commercial Off The Shelf
(COTS) components are used to help integrate functions that are not the core competency of the
company developing the system. These components, when stitched together, ultimately yield a web
of complexity in their dependencies. Often, the low level of characterization that accompanies
COTS components aggravates issues associated with building dependable systems. Hardware
components generally have a much stronger discipline in their characterization than do software
components. It is this inherent use of these components and their coupled errors that requires the
use of fault removal in systems in order to maintain high levels of availability.
The fact that software reliability engineering has not advanced as quickly as hardware reliability
engineering is further aggravated by the fact software is often used to mask hardware failures. In
many systems, user interfaces are either controlled by or presented by software; hence, they are the
barrier between the user and a system exposing failures via the loss of service or incorrect service
from those interfaces. Oftentimes, these software systems control and manage the loose coupling
between user and system, and when possible, provide failure avoidance measures.
Software reliability is similar to hardware reliability in that both are stochastic processes and can
be described by probability distributions. However, software reliability is different from hardware
reliability in that software does not wear out, burn out, or deteriorate (i.e., its reliability does not
decrease with time). Additionally, software generally benefits from reliability growth during
testing and operation since software faults can be detected and removed when software failures
occur. On the other hand, software may experience reliability decrease due to abrupt changes of its
operation usage or incorrect modifications to the software. Software is also continuously modified
throughout its life cycle. This malleability of software only increases the risk of introducing new
errors.