Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
26
One or more primary components, together with their redundant counterparts, act together to
provide a reliable service. A group of such components is defined as a service group, an example
of which might be the power supplies in a chassis, assuming that they were configured such that at
least one was redundant.
Spatial redundancy can be applied in a number of different ways as described in the next four
sections.
3.5.1 Classical Fault Tolerance
Redundant components are used to process identical data, and a voting process ensures identical
results. Of course, since the voter itself could be erroneous, then that also must be redundant, and
there are various schemes possible.
Two such service-providing components, along with their associated voters, are known as dual
modular redundant systems, or DMR. While DMR systems provide the basic integrity checking,
in the event of a voting disagreement it can be difficult to determine the unit at fault. Accordingly,
it is more common to employ such components in groups of three, which makes possible the “odd
man out” process of fault diagnosis. Such triplicated components and their voters are known as
triple modular redundant, or TMR. The provision of hardware and voters three times has a very
adverse impact on cost.
3.5.2 Standby, or Hot Sparing
A more basic method of providing redundancy is simply to add an additional component to the
system. This acts as a standby component and is not used until the primary service provider fails. If
the primary service provider is disabled, the standby component becomes active, restoring the
service. Such simple duplication is frequently known as active/passive, or 1+1 redundancy.
Determining which adaptors are paired, and which are active or standby, is known as role
assignment, and is a necessary task for such systems.
An extension of this is where a service is provided by a number of identical components such as
line adaptors. Here it is possible to over-provision the line adaptors and designate the excess as
standbys. The general case is referred to as N+M redundancy, where N is the number of primary
service providers, and M is the number of stand-bys.
3.5.3 Load Sharing
Rather than having the redundant hardware be idle, it is more cost-effective to allow the back-up
hardware to share in providing service. This is usually more difficult to manage, since a function to
distribute the tasks between the service providers must be provided, and furthermore the service
performance may be seen to degrade when one of the providing components fails. Typically, a
minimum set K of providers is defined which will satisfy the service requirement, and then the load
is further spread over the full set N of providers. This is designated as K:N redundancy
(pronounced K out of N) and has a minimum value of 1:2.
3.5.4 Clustering
The above sections refer to components that provide service, and can be applied no matter what the
size, scope, or nature of the component. There is however, a special case where the service-
providing component in question is itself a complete computer system, containing hardware,
operating system, and communications capabilities. Such a component is known as a node, and a
system constructed using such nodes is known as a cluster.