Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
59
maximum capacity. Removing the first power module that indicated it was not able to keep up will
then cause the remaining power modules to be even more overloaded, resulting in the reverse
reaction where power modules would shutdown causing the entire system to fail.
In a system constructed to provide a high level of service availability to adequately perform
isolation, a clear understanding of cause and effect of a component’s existence in a system is
required.
A software example of isolation would be that of a continuous spurious interrupt. If no component
is indicating the need for service, and the interrupts continue to be fielded, then that could cause a
disruption of service. An interrupt masking may be appropriate to isolate the problem. Again, it is
important to understand the ramifications that masking that particular type of interrupt may cause
on the system.
6.3.5 Techniques
Component Isolation
If a component is causing a detrimental effect on a bus or communication channel, that device can
be isolated physically (using hardware) or logically (by requesting the system to stop
communicating with it.) If logical isolation is not successful, physical isolation may be needed.
Quiescing Components
Even if a component is operating in a degraded state or causing a detrimental effect on a bus or
communication channel, it may be desirable to have that component complete tasks it has in
progress before isolating it. This can reduce the number of lost tasks which occur on switchover.
Quiescing is usually done by isolating the inputs, and then waiting for all output to stop before
isolating the outputs.
Safe Value Output
If a component is supposed to output a value in a certain range and fails to do so, one could just
block, or turn off that component. Unfortunately, this may cause an entire system failure. A second
approach would be to have the component output a safe value that will allow the system to function
(perhaps in a degraded manner) until a recovery can be implemented. Value coasting, or
maintaining the last valid value, is one method of providing a safe value output. Another method
would be to use a pre-programmed value table.
It is important to note that if this method is used, the management system and all components using
the output of this component must be notified of the failure. If this is not done it is possible that the
failure would not be detected, and it would not be known that the system was using false data. This
is critical in HA systems, as silent failures are not acceptable.
Routing Change
If a component provides a service that more than one device can provide, then another method of
isolation could be to re-assign and/or remove the device from the service list. In the case of a
networking device, this could be removing or reassigning the routing table to remove the failed
component.