Providing Open Architecture High Availability Solutions

6.3.6 Dependencies

Fault isolation is dependent on the results of the diagnosis as well as the definition of the system

dependency tree. The dependencies of software modules or hardware components are the active

results of the mapping of the system (system model defined in Section 5.3) and the results of

reliability modeling (system modeling for reliability in Section 3.3) The depth and thoroughness of

the detection infrastructure is quite often proportional to the level fault awareness.

6.4 Recovery

6.4.1 Introduction

Recovery is the process of reassigning the necessary resources to restore the system to an operating

state. Recovery also requires restoring any portions of the system that were adversely affected by

the failing component. Recovery is the process of providing some level of service back to the

systems. This can be in a reduced capacity if the isolated component does not have a redundant

component or it could be the activation of a redundant component.

6.4.2 Objective

The objective of the recovery process is to restore the system to an operating state, even if it is in a

reduced capacity.

6.4.3 Concepts

Rebalance/Re-route. In an N+1 or N+M system, there is a hierarchy of component dependencies.

When a component is determined to be bad, any other system components depending upon this

resource must be recovered as well. This aspect of topology management becomes part of the

critical path to restoring service.

Active/Standby. When redundant components are used, one is commonly in use (active) while the

other is in standby, waiting to take over in case of a failure. The standby component is typically

receiving either the same input as the active component, or is receiving information about what the

active component is processing and what the last processed data is.

Checkpointing. The technique used to keep a standby component aware of where it should start

processing when it takes over is referred to as checkpointing.

Reset/restart. A technique that can be used to recover a failed component is the process of resetting

and/or restarting that component. The process would be performed on a standby component if the

system has redundant components, or on the active component if it can be determined that it is a

transient failure versus a hard failure.

6.4.4 Approach

In Section 3.3.3, the recovery action was briefly mentioned with respect to the ability to tolerate a

fault. Several common techniques are used for the recovery action depending on the specific needs

of the system and its application. For a fault management infrastructure to work, each method will

need to be able to be implemented either directly or with the pieces provided.