Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
73
8.0 Layer-Specific Capabilities – Hardware
High availability hardware system architectures are created by combining fault domains into
service groups in such a way that the system can continue to operate even when any particular fault
domain is out of service. A wide variety of fault domain configurations are possible. Roughly
speaking, high availability system architectures fall on a spectrum based on the granularity and
complexity of the fault domain model. At the two ends of this spectrum are:
Clustering. A fault domain consists of an entire computer, complete with CPU, memory, I/O
controllers, I/O devices, power conversion and distribution systems, cooling systems, etc.
Multiples of these computers (often called nodes) are then used as redundant fault domains.
Hardware Fault Tolerance. A single computer is made up of multiple, redundant fault domains.
The hardware design is such that the computer continues to provide full service even if any of its
constituent fault domains fail.
Today, many high availability computer systems fall between these end points. These systems
contain some fault domains which look and operate much like nodes in a clustering system, but
have other fault domains that are managed in a fault-tolerant mode. An example of such a system
would be two complete processing units, complete with CPU, I/O controllers, and backplanes
housed in a single enclosure with common power supplies and fans, and connected to a single,
shared RAID disk subsystem.
The required capabilities of hardware components are dependent on the partitioning of the overall
system into fault domains. At the highest level, hardware capabilities of fault-managed systems can
be categorized into three sets:
• Capabilities to allow continued processing after failure of a fault domain using redundant fault
domains
• Capabilities to allow highly reliable (often redundant) communication among fault domains
• Capabilities to allow management of the fault domains that include fault detection, diagnosis,
isolation, recovery, and repair at least to the fault domain level of granularity
8.1 Redundancy
The most fundamental capability provided by the hardware of high availability fault-managed
systems is the provision of redundant fault domains. In any high availability system, fault domains
must be identified. Then, for each fault domain, the required redundancy must also be identified
that will permit continued provision of the services of that fault domain when it is not functional.
As an example, in clustered systems, fault domains are complete computer nodes. Redundancy is
provided by including at least enough nodes to support the minimum required performance of the
system when any one of the nodes are out of service.
For systems that contain fault domains smaller than a complete computer node, each fault domain
must be made redundant – at least in a N+1 mode, so that whatever services a particular fault
domain provides to the system, those services will still be provided when that fault domain is out of
service. Typical subsystems that make up fault domains, and thus are provisioned redundantly and
organized into service groups include:
• Processing subsystems
• I/O controllers