Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
76
Because some communication links are difficult to terminate at multiple points in a system, and
because redundancy in the external communication paths is desirable for its own sake, high
availability systems are often designed with redundant external communication links, each of
which is logically part of the fault domain that includes its termination point in the system.
8.3 Platform Management
As described above, high availability systems include redundant hardware, and this redundancy
can be described as a set of service groups, with each service group consisting of redundant fault
domains. The hardware of high availability systems can be implemented in a wide variety of ways,
depending on system requirements. But, no matter how fault domains and service groups are
designed, a common requirement of any system that includes redundant hardware is some level of
manageability of the hardware platform. This capability is called platform management. The
function of platform management is to provide for monitoring and control of the hardware
components. The minimum goal of platform management in a high availability system is to
provide monitoring and control required for the detection, diagnosis, isolation, recovery, and repair
of fault domains. Beyond that, additional platform management capabilities aimed at predicting
and preventing fault domain failures may be included.
Platform management activity generally requires communication of management data between
management software and hardware components, or among hardware components themselves.
This communication can be provided in-band (i.e., using the same data paths used for the primary
operation of the system), or out-of-band (i.e., using dedicated platform management data paths).
Often, out-of-band communication is used in high availability systems, because the management
data path can exist in a separate fault domain from the primary system data paths. This is important
since the management function is often critically required after a fault has occurred to cause a
system recovery. In the face of a fault, primary data paths may well be unusable, while separate
management data paths can still function.
The minimum required platform management functionality may be delivered in a variety of ways;
however, a subsidiary goal of defining standard interfaces leads to additional desirable capabilities
of the platform management subsystem.
8.3.1 Fault Domain Failure Detection
However fault domains are constructed, a critical hardware capability is the detection of failures of
fault domains and communication of those failures. This may be communicated out-of-band
through a management data channel or in-band via unambiguous observable behavior (or non-
behavior). A primary example of the latter is fail-safe behavior, where a fault domain contains a
self-checking capability which causes it to promptly shut down when a fault is detected. The
resulting shutdown is then observed by other parts of the system.
Beyond the immediate communication required for fault diagnosis and isolation, hardware fault
domain failures must also be communicated to appropriate subsystems (or people) in order to
trigger recovery and repair actions. For example, consider a failed current-sharing power supply —
the immediate fault detection and isolation is carried out by current sharing circuitry and output
diodes. This action is effectively transparent to the rest of the system. Thus, another means of
communicating the failure must exist to trigger an action to repair the failed supply. This may be
via de-asserting a ‘power supply OK’ signal or management event generation.