Providing Open Architecture High Availability Solutions

8.0 Layer-Specific Capabilities – Hardware

High availability hardware system architectures are created by combining fault domains into

service groups in such a way that the system can continue to operate even when any particular fault

domain is out of service. A wide variety of fault domain configurations are possible. Roughly

speaking, high availability system architectures fall on a spectrum based on the granularity and

complexity of the fault domain model. At the two ends of this spectrum are:

Clustering. A fault domain consists of an entire computer, complete with CPU, memory, I/O

controllers, I/O devices, power conversion and distribution systems, cooling systems, etc.

Multiples of these computers (often called nodes) are then used as redundant fault domains.

Hardware Fault Tolerance. A single computer is made up of multiple, redundant fault domains.

The hardware design is such that the computer continues to provide full service even if any of its

constituent fault domains fail.

Today, many high availability computer systems fall between these end points. These systems

contain some fault domains which look and operate much like nodes in a clustering system, but

have other fault domains that are managed in a fault-tolerant mode. An example of such a system

would be two complete processing units, complete with CPU, I/O controllers, and backplanes

housed in a single enclosure with common power supplies and fans, and connected to a single,

shared RAID disk subsystem.

The required capabilities of hardware components are dependent on the partitioning of the overall

system into fault domains. At the highest level, hardware capabilities of fault-managed systems can

be categorized into three sets:

• Capabilities to allow continued processing after failure of a fault domain using redundant fault

domains

• Capabilities to allow highly reliable (often redundant) communication among fault domains

• Capabilities to allow management of the fault domains that include fault detection, diagnosis,

isolation, recovery, and repair at least to the fault domain level of granularity

8.1 Redundancy

The most fundamental capability provided by the hardware of high availability fault-managed

systems is the provision of redundant fault domains. In any high availability system, fault domains

must be identified. Then, for each fault domain, the required redundancy must also be identified

that will permit continued provision of the services of that fault domain when it is not functional.

As an example, in clustered systems, fault domains are complete computer nodes. Redundancy is

provided by including at least enough nodes to support the minimum required performance of the

system when any one of the nodes are out of service.

For systems that contain fault domains smaller than a complete computer node, each fault domain

must be made redundant – at least in a N+1 mode, so that whatever services a particular fault

domain provides to the system, those services will still be provided when that fault domain is out of

service. Typical subsystems that make up fault domains, and thus are provisioned redundantly and

organized into service groups include:

• Processing subsystems

• I/O controllers