Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
60
6.3.6 Dependencies
Fault isolation is dependent on the results of the diagnosis as well as the definition of the system
dependency tree. The dependencies of software modules or hardware components are the active
results of the mapping of the system (system model defined in Section 5.3) and the results of
reliability modeling (system modeling for reliability in Section 3.3) The depth and thoroughness of
the detection infrastructure is quite often proportional to the level fault awareness.
6.4 Recovery
6.4.1 Introduction
Recovery is the process of reassigning the necessary resources to restore the system to an operating
state. Recovery also requires restoring any portions of the system that were adversely affected by
the failing component. Recovery is the process of providing some level of service back to the
systems. This can be in a reduced capacity if the isolated component does not have a redundant
component or it could be the activation of a redundant component.
6.4.2 Objective
The objective of the recovery process is to restore the system to an operating state, even if it is in a
reduced capacity.
6.4.3 Concepts
Rebalance/Re-route. In an N+1 or N+M system, there is a hierarchy of component dependencies.
When a component is determined to be bad, any other system components depending upon this
resource must be recovered as well. This aspect of topology management becomes part of the
critical path to restoring service.
Active/Standby. When redundant components are used, one is commonly in use (active) while the
other is in standby, waiting to take over in case of a failure. The standby component is typically
receiving either the same input as the active component, or is receiving information about what the
active component is processing and what the last processed data is.
Checkpointing. The technique used to keep a standby component aware of where it should start
processing when it takes over is referred to as checkpointing.
Reset/restart. A technique that can be used to recover a failed component is the process of resetting
and/or restarting that component. The process would be performed on a standby component if the
system has redundant components, or on the active component if it can be determined that it is a
transient failure versus a hard failure.
6.4.4 Approach
In Section 3.3.3, the recovery action was briefly mentioned with respect to the ability to tolerate a
fault. Several common techniques are used for the recovery action depending on the specific needs
of the system and its application. For a fault management infrastructure to work, each method will
need to be able to be implemented either directly or with the pieces provided.