Providing Open Architecture High Availability Solutions

To provide service availability, the current state of transactions often must be maintained in a hot

standby redundant component. This means that ongoing transaction data and application state data

must be continuously delivered (checkpointed) to a hot standby location. Real-time checkpointing

that provides transaction state data for graceful switchover requires a high-speed data storage and

retrieval system and a fully-optimized method of messaging and communication.

To ensure that the heartbeats and checkpointed information are properly received by the standby

component, it is recommended that the paths used for this communication be redundant. This

minimizes the impact of a failed communication path due to hardware (i.e., failed interconnect) or

software (i.e., failed TCP/IP stack).

10.4 Detecting, Diagnosing and Isolating Faults

At the most basic level, fault management must detect faults and initiate notification and an

appropriate recovery action. The fault management required to achieve service availability is

necessarily more complex. It must detect not only active faults, but also latent faults, in order to

anticipate and avoid critical system errors. It must perform sophisticated diagnosis and root cause

analysis for quick and appropriate containment of faults with minimum system impact. And it must

intelligently determine and initiate policy-based recovery actions.

Detection is the discovery of a fault or symptom. It is the identification of an undesirable condition

that may lead to the loss of service from the system or device. Detection may be based on error

detection (through direct observation or correlation of multiple events in location or time) or

inference (by observing other behavior of the system). Detection is accomplished by creating fault

detectors that are associated with data collectors in managed components.

The management middleware should provide a variety of methods for detecting faults. It should

actively monitor the system model and its managed components for state changes that will impact

the availability of a service. It should also monitor the system components directly via methods

contained in the managed components to receive fault events that are sent directly from the

component. Ideally, the management middleware should also provide the ability to analyze trends

in system performance as well as transient and recoverable events in order to predict faults.

Diagnosis analyzes one or more events and system parameters to determine the nature and location

of a fault reported by a detector. Identifying the originating fault (root cause analysis) is part of

diagnosis; this can be difficult, since errors tend to multiply quickly and a single fault may lead to

multiple errors. In complex systems, both diagnosis (including root cause analysis) and the

subsequent isolation of the fault depend on information about the current configuration topology

and dependency relationships represented in the system model.

Typically, a fault diagnosis may automatically trigger local actions within the component, forward

data to management middleware for action, and also may generate notification of administrative

personnel. The management middleware should handle each of these cases.

Isolation contains a fault to keep it from spreading throughout a system. Configuration and

dependency relationships must be understood in order to partition appropriate fault containment

regions.