Providing Open Architecture High Availability Solutions

One or more primary components, together with their redundant counterparts, act together to

provide a reliable service. A group of such components is defined as a service group, an example

of which might be the power supplies in a chassis, assuming that they were configured such that at

least one was redundant.

Spatial redundancy can be applied in a number of different ways as described in the next four

sections.

3.5.1 Classical Fault Tolerance

Redundant components are used to process identical data, and a voting process ensures identical

results. Of course, since the voter itself could be erroneous, then that also must be redundant, and

there are various schemes possible.

Two such service-providing components, along with their associated voters, are known as dual

modular redundant systems, or DMR. While DMR systems provide the basic integrity checking,

in the event of a voting disagreement it can be difficult to determine the unit at fault. Accordingly,

it is more common to employ such components in groups of three, which makes possible the “odd

man out” process of fault diagnosis. Such triplicated components and their voters are known as

triple modular redundant, or TMR. The provision of hardware and voters three times has a very

adverse impact on cost.

3.5.2 Standby, or Hot Sparing

A more basic method of providing redundancy is simply to add an additional component to the

system. This acts as a standby component and is not used until the primary service provider fails. If

the primary service provider is disabled, the standby component becomes active, restoring the

service. Such simple duplication is frequently known as active/passive, or 1+1 redundancy.

Determining which adaptors are paired, and which are active or standby, is known as role

assignment, and is a necessary task for such systems.

An extension of this is where a service is provided by a number of identical components such as

line adaptors. Here it is possible to over-provision the line adaptors and designate the excess as

standbys. The general case is referred to as N+M redundancy, where N is the number of primary

service providers, and M is the number of stand-bys.

3.5.3 Load Sharing

Rather than having the redundant hardware be idle, it is more cost-effective to allow the back-up

hardware to share in providing service. This is usually more difficult to manage, since a function to

distribute the tasks between the service providers must be provided, and furthermore the service

performance may be seen to degrade when one of the providing components fails. Typically, a

minimum set K of providers is defined which will satisfy the service requirement, and then the load

is further spread over the full set N of providers. This is designated as K:N redundancy

(pronounced K out of N) and has a minimum value of 1:2.

3.5.4 Clustering

The above sections refer to components that provide service, and can be applied no matter what the

size, scope, or nature of the component. There is however, a special case where the service-

providing component in question is itself a complete computer system, containing hardware,

operating system, and communications capabilities. Such a component is known as a node, and a

system constructed using such nodes is known as a cluster.