Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
18
To help combat the significant influence of software failures in large systems, software reuse can
be applied. The continual improvement in reliability when reusing software has demonstrated
benefits. This concept is known as reliability growth. Object-oriented software engineering
practices further encourage software reuse and have also demonstrated significant improvements
and reliability growth when maintained and reused.
Fault Removal
Fault removal during development is comprised of three steps: verification, diagnosis, and
correction. Verification is the process of checking if the system is fulfilling its intended function. If
it does not fulfill the intended function, diagnosis and subsequent correction is required to remove
the fault and failure condition. Verification techniques vary but generally may be characterized as
being either static or dynamic.
Static verification involves the static analysis of a system. This may be the application of design
inspections, data flow analysis, quantitative or proof-of-correctness analysis. In each of these cases
the system is verified (at least to some extent) without its execution or delivery of service. Dynamic
verification requires the system to be executing its function(s) with the intent to verify the mapping
of input stimuli to output response. In most large complex systems, the ability to verify all possible
inputs and their representative outputs is typically an intractable problem.
The emphasis of fault removal is different for hardware and software systems. In hardware systems
the emphasis is on the removal (or more commonly referred to as the repair) of potential
production faults. In software, the emphasis is on the removal of design faults.
When fault removal is considered after a system has been commissioned, it is called corrective
maintenance [Lapr92]. This is either preventative maintenance to ward off faults before they
produce errors, or curative maintenance, which is aimed towards the removal of faults reported
against the system.
Fault Tolerance
Fault tolerance is an attribute associated with a computing system that provides continuous
service in the presence of faults. The active ability of a system to circumvent or otherwise
compensate for activated faults is varied among different fault tolerance techniques. In the case of
hardware, fault tolerance is often achieved through the use of redundant elements in the system. In
the case of software, fault tolerance techniques generally are realized through design diversity.
Multiple implementations of software provide the same purpose. The implementations are often
constructed by different development teams, designing to the same specifications but with alternate
implementations. Two popular approaches include N-versioning and the recovery block method.
Redundancy may also be applied to both software and hardware systems, but it is typically less
effective in software systems. Redundancy in software can help the continuity of services that
exhibit long degradation intervals. This is often how bugs pass through verification testing and
make their way to the field. In these cases, the faults usually only manifest themselves under
certain timing and loading situations (that inevitably occur at key customer sites) that are too
stochastic to be reproducible. These bugs are often referred to as Heisenbugs [Gray92]. The causes
of Heisenbugs are seldom ever determined. In these cases, software redundancy can help as the
long term timing of events in the software modules will seldom be exactly the same. The
differences in timing can be increased by periodically causing a switchover between the modules
and restarting the no longer active module.