Providing Open Architecture High Availability Solutions

When a system contains fault domains that are effectively in a standby mode, there is a need for

detection of latent faults in these domains. That is, if the primary failure detection mechanism is

observation of normal operating behavior, the hardware may need to provide a separate mechanism

for detection of faults in fault domains which are not normally operating.

8.3.2 Fault Domain Diagnosis

In cases where a failure of a fault domain is detectable, but it is not immediately evident which of

several redundant fault domains has failed, hardware must support a diagnostic function which will

determine where the actual failure exists. For example, consider a system with multiple fans, each

of which makes up a separate fault domain, because the failure of any one fan can be compensated

by increasing the speed of others. If a failure of a fan is detected indirectly (e.g., temperature or air-

flow detector) additional diagnosis may be required to determine which fan had failed so that the

proper isolation, recovery, and repair actions may be initiated.

Diagnosis of failures may occur through diagnosis capabilities of the hardware components

themselves, or through capabilities of management software that can analyze a variety of

indications and/or initiate various actions in an attempt to determine which hardware component

has failed. A diagnosis capability in the hardware itself basically means that it is able to answer the

question, “Are you okay?” If the hardware cannot directly answer this question, then it must at

least have the capabilities needed by management software for it to able to diagnose the failure.

In a system with nested fault domains (e.g., Figure 14), diagnosis may aid in converting a detected

failure of the larger fault domain to a failure of the nested fault domain. For example, in Figure 14,

if one of the I/O controllers on a shared bus failed in a way which caused the bus to hang, the fault

domain associated with the bus segment would fail, which is what would be detected. It would be

highly desirable in this case to be able to do a diagnosis of the individual I/O controllers, determine

which one is failing, and isolate it from the bus. This would result in removing the fault from the

larger fault domain, and restoring the functionality of all the other I/O controllers on the bus

segment.

Additional diagnosis capabilities can also be useful to be able to repair a fault domain more

quickly. For example, if an I/O controller fails, after the system isolates and recovers from the

failure, it would be desirable to have the platform management system order the controller to run a

self-test to determine if the failure was transient or permanent. If transient, then the I/O controller

could be immediately reintegrated. If permanent, then a technician would have to be dispatched to

replace the board.

8.3.3 Fault Domain Isolation

The ability to isolate fault domains from the system is critical to the design of a fault managed

system. Isolation means taking whatever action is needed to prevent a fault from affecting other

fault domains.

Often, this capability is an integral part of the design of a hardware component. For example,

power supplies include circuits that monitor output voltages being produced, and quickly shut the

supply off if voltages venture out of specifications. In other cases, another part of the system may

need to initiate an action to isolate a fault domain. Even when isolation is automatic by design, it

can be highly desirable, as a further safety feature, for any fault domain to be able to isolate itself

from the rest of the system upon request from a platform management system.

Because of the importance of domain isolation, and because of the difficulty in guaranteeing the

behavior of a fault domain which is already misbehaving, there may be multiple levels of fault

isolation capability in systems. For example, if a host processor is not responding, it may be