Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
77
When a system contains fault domains that are effectively in a standby mode, there is a need for
detection of latent faults in these domains. That is, if the primary failure detection mechanism is
observation of normal operating behavior, the hardware may need to provide a separate mechanism
for detection of faults in fault domains which are not normally operating.
8.3.2 Fault Domain Diagnosis
In cases where a failure of a fault domain is detectable, but it is not immediately evident which of
several redundant fault domains has failed, hardware must support a diagnostic function which will
determine where the actual failure exists. For example, consider a system with multiple fans, each
of which makes up a separate fault domain, because the failure of any one fan can be compensated
by increasing the speed of others. If a failure of a fan is detected indirectly (e.g., temperature or air-
flow detector) additional diagnosis may be required to determine which fan had failed so that the
proper isolation, recovery, and repair actions may be initiated.
Diagnosis of failures may occur through diagnosis capabilities of the hardware components
themselves, or through capabilities of management software that can analyze a variety of
indications and/or initiate various actions in an attempt to determine which hardware component
has failed. A diagnosis capability in the hardware itself basically means that it is able to answer the
question, “Are you okay?” If the hardware cannot directly answer this question, then it must at
least have the capabilities needed by management software for it to able to diagnose the failure.
In a system with nested fault domains (e.g., Figure 14), diagnosis may aid in converting a detected
failure of the larger fault domain to a failure of the nested fault domain. For example, in Figure 14,
if one of the I/O controllers on a shared bus failed in a way which caused the bus to hang, the fault
domain associated with the bus segment would fail, which is what would be detected. It would be
highly desirable in this case to be able to do a diagnosis of the individual I/O controllers, determine
which one is failing, and isolate it from the bus. This would result in removing the fault from the
larger fault domain, and restoring the functionality of all the other I/O controllers on the bus
segment.
Additional diagnosis capabilities can also be useful to be able to repair a fault domain more
quickly. For example, if an I/O controller fails, after the system isolates and recovers from the
failure, it would be desirable to have the platform management system order the controller to run a
self-test to determine if the failure was transient or permanent. If transient, then the I/O controller
could be immediately reintegrated. If permanent, then a technician would have to be dispatched to
replace the board.
8.3.3 Fault Domain Isolation
The ability to isolate fault domains from the system is critical to the design of a fault managed
system. Isolation means taking whatever action is needed to prevent a fault from affecting other
fault domains.
Often, this capability is an integral part of the design of a hardware component. For example,
power supplies include circuits that monitor output voltages being produced, and quickly shut the
supply off if voltages venture out of specifications. In other cases, another part of the system may
need to initiate an action to isolate a fault domain. Even when isolation is automatic by design, it
can be highly desirable, as a further safety feature, for any fault domain to be able to isolate itself
from the rest of the system upon request from a platform management system.
Because of the importance of domain isolation, and because of the difficulty in guaranteeing the
behavior of a fault domain which is already misbehaving, there may be multiple levels of fault
isolation capability in systems. For example, if a host processor is not responding, it may be