Providing Open Architecture High Availability Solutions

To help combat the significant influence of software failures in large systems, software reuse can

be applied. The continual improvement in reliability when reusing software has demonstrated

benefits. This concept is known as reliability growth. Object-oriented software engineering

practices further encourage software reuse and have also demonstrated significant improvements

and reliability growth when maintained and reused.

Fault Removal

Fault removal during development is comprised of three steps: verification, diagnosis, and

correction. Verification is the process of checking if the system is fulfilling its intended function. If

it does not fulfill the intended function, diagnosis and subsequent correction is required to remove

the fault and failure condition. Verification techniques vary but generally may be characterized as

being either static or dynamic.

Static verification involves the static analysis of a system. This may be the application of design

inspections, data flow analysis, quantitative or proof-of-correctness analysis. In each of these cases

the system is verified (at least to some extent) without its execution or delivery of service. Dynamic

verification requires the system to be executing its function(s) with the intent to verify the mapping

of input stimuli to output response. In most large complex systems, the ability to verify all possible

inputs and their representative outputs is typically an intractable problem.

The emphasis of fault removal is different for hardware and software systems. In hardware systems

the emphasis is on the removal (or more commonly referred to as the repair) of potential

production faults. In software, the emphasis is on the removal of design faults.

When fault removal is considered after a system has been commissioned, it is called corrective

maintenance [Lapr92]. This is either preventative maintenance to ward off faults before they

produce errors, or curative maintenance, which is aimed towards the removal of faults reported

against the system.

Fault Tolerance

Fault tolerance is an attribute associated with a computing system that provides continuous

service in the presence of faults. The active ability of a system to circumvent or otherwise

compensate for activated faults is varied among different fault tolerance techniques. In the case of

hardware, fault tolerance is often achieved through the use of redundant elements in the system. In

the case of software, fault tolerance techniques generally are realized through design diversity.

Multiple implementations of software provide the same purpose. The implementations are often

constructed by different development teams, designing to the same specifications but with alternate

implementations. Two popular approaches include N-versioning and the recovery block method.

Redundancy may also be applied to both software and hardware systems, but it is typically less

effective in software systems. Redundancy in software can help the continuity of services that

exhibit long degradation intervals. This is often how bugs pass through verification testing and

make their way to the field. In these cases, the faults usually only manifest themselves under

certain timing and loading situations (that inevitably occur at key customer sites) that are too

stochastic to be reproducible. These bugs are often referred to as Heisenbugs [Gray92]. The causes

of Heisenbugs are seldom ever determined. In these cases, software redundancy can help as the

long term timing of events in the software modules will seldom be exactly the same. The

differences in timing can be increased by periodically causing a switchover between the modules

and restarting the no longer active module.