Providing Open Architecture High Availability Solutions

Because some communication links are difficult to terminate at multiple points in a system, and

because redundancy in the external communication paths is desirable for its own sake, high

availability systems are often designed with redundant external communication links, each of

which is logically part of the fault domain that includes its termination point in the system.

8.3 Platform Management

As described above, high availability systems include redundant hardware, and this redundancy

can be described as a set of service groups, with each service group consisting of redundant fault

domains. The hardware of high availability systems can be implemented in a wide variety of ways,

depending on system requirements. But, no matter how fault domains and service groups are

designed, a common requirement of any system that includes redundant hardware is some level of

manageability of the hardware platform. This capability is called platform management. The

function of platform management is to provide for monitoring and control of the hardware

components. The minimum goal of platform management in a high availability system is to

provide monitoring and control required for the detection, diagnosis, isolation, recovery, and repair

of fault domains. Beyond that, additional platform management capabilities aimed at predicting

and preventing fault domain failures may be included.

Platform management activity generally requires communication of management data between

management software and hardware components, or among hardware components themselves.

This communication can be provided in-band (i.e., using the same data paths used for the primary

operation of the system), or out-of-band (i.e., using dedicated platform management data paths).

Often, out-of-band communication is used in high availability systems, because the management

data path can exist in a separate fault domain from the primary system data paths. This is important

since the management function is often critically required after a fault has occurred to cause a

system recovery. In the face of a fault, primary data paths may well be unusable, while separate

management data paths can still function.

The minimum required platform management functionality may be delivered in a variety of ways;

however, a subsidiary goal of defining standard interfaces leads to additional desirable capabilities

of the platform management subsystem.

8.3.1 Fault Domain Failure Detection

However fault domains are constructed, a critical hardware capability is the detection of failures of

fault domains and communication of those failures. This may be communicated out-of-band

through a management data channel or in-band via unambiguous observable behavior (or non-

behavior). A primary example of the latter is fail-safe behavior, where a fault domain contains a

self-checking capability which causes it to promptly shut down when a fault is detected. The

resulting shutdown is then observed by other parts of the system.

Beyond the immediate communication required for fault diagnosis and isolation, hardware fault

domain failures must also be communicated to appropriate subsystems (or people) in order to

trigger recovery and repair actions. For example, consider a failed current-sharing power supply —

the immediate fault detection and isolation is carried out by current sharing circuitry and output

diodes. This action is effectively transparent to the rest of the system. Thus, another means of

communicating the failure must exist to trigger an action to repair the failed supply. This may be

via de-asserting a ‘power supply OK’ signal or management event generation.