Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
25
The trend of the a’ line depicts the results of a hyper-exponential model [lapr92] that is also
backed by typical field data. It shows that systems typically peak in their unavailability shortly
after deployment, then, through reliability growth (defect removal), eventually stabilize near their
expected reliability level (c). The ‘b’ line represents the pessimistic reliability prior to defect
removal and reliability growth. The ‘c’ line represents the optimistic reliability after reliability
growth has taken place.
The ability to model the reliability and availability of a system taking into account reliability
growth is a relatively complex topic beyond the scope of this paper, however they can be
approximated by a statistical estimation. Details on this can be found in Laprie’s ‘Dependability:
Basic Concepts and Terminology [Lapr92].
In that example, the availability of a system is given by the ratio of non-failed components at time t
to total number of components in the set. Integrated over time, the average gives an approximation
of the availability of the system from its time origin. The intervals or quantization of t over which
the availability is measured sets the granularity of observed availability. This estimation is a
passive appraisal of availability. That is, it is a simple way of expressing the observed availability
of a system once it is commissioned. An active model, like the one suggested by Laprie, offers the
ability to analytically estimate system availability prior to its commissioning [Lapr92]. While this
is only a model, it is one of the responsible ways in which system engineers can increase their
confidence and help reduce the risks in their ability to design and deploy highly-available systems.
3.5 Redundancy
Almost all hardware fault management techniques utilize redundancy, so it is helpful to expand this
further. Redundancy is generally classified into three basic types:
1. Spatial, which consumes space. This involves provisioning of more resources to provide a
service than would otherwise be needed – the extra resources are used when a primary
resource fails.
2. Temporal, which consumes time. An example of temporal redundancy is the ACK/NAK
method used by many protocols. A failure of reception causes the message to be repeated,
which is the redundancy, while the penalty is that of time.
3. Structural, sometimes called contextual. Here the redundancy is held at somewhat less than a
full duplication by making use of properties of the data. Examples are the Hamming codes
used in error correction, or the exclusive-OR commutative property used by RAID
subsystems.
To be effective, redundancy must be applied in such a way that one fault could not cause both
copies of a component to fail. This brings in the notion of a fault domain, which can be defined as
the locus or scope of a fault. By recognizing that faults have a defined locality or domain, then it is
a necessary condition for fault tolerance that redundant components must be provisioned in
different fault domains.
To illustrate this, consider a system having a bus, in which a peripheral adaptor provides a service.
Simply replicating the adaptor on the bus is not enough to ensure continuity of service, since a bus
fault can cause both adaptors to be unreachable. The adaptors are replicated within the same fault
domain – that of the single bus. This is why bus-based highly-available systems contain at least two
buses, with redundant components allocated to each bus.