Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
97
To provide service availability, the current state of transactions often must be maintained in a hot
standby redundant component. This means that ongoing transaction data and application state data
must be continuously delivered (checkpointed) to a hot standby location. Real-time checkpointing
that provides transaction state data for graceful switchover requires a high-speed data storage and
retrieval system and a fully-optimized method of messaging and communication.
To ensure that the heartbeats and checkpointed information are properly received by the standby
component, it is recommended that the paths used for this communication be redundant. This
minimizes the impact of a failed communication path due to hardware (i.e., failed interconnect) or
software (i.e., failed TCP/IP stack).
10.4 Detecting, Diagnosing and Isolating Faults
At the most basic level, fault management must detect faults and initiate notification and an
appropriate recovery action. The fault management required to achieve service availability is
necessarily more complex. It must detect not only active faults, but also latent faults, in order to
anticipate and avoid critical system errors. It must perform sophisticated diagnosis and root cause
analysis for quick and appropriate containment of faults with minimum system impact. And it must
intelligently determine and initiate policy-based recovery actions.
Detection is the discovery of a fault or symptom. It is the identification of an undesirable condition
that may lead to the loss of service from the system or device. Detection may be based on error
detection (through direct observation or correlation of multiple events in location or time) or
inference (by observing other behavior of the system). Detection is accomplished by creating fault
detectors that are associated with data collectors in managed components.
The management middleware should provide a variety of methods for detecting faults. It should
actively monitor the system model and its managed components for state changes that will impact
the availability of a service. It should also monitor the system components directly via methods
contained in the managed components to receive fault events that are sent directly from the
component. Ideally, the management middleware should also provide the ability to analyze trends
in system performance as well as transient and recoverable events in order to predict faults.
Diagnosis analyzes one or more events and system parameters to determine the nature and location
of a fault reported by a detector. Identifying the originating fault (root cause analysis) is part of
diagnosis; this can be difficult, since errors tend to multiply quickly and a single fault may lead to
multiple errors. In complex systems, both diagnosis (including root cause analysis) and the
subsequent isolation of the fault depend on information about the current configuration topology
and dependency relationships represented in the system model.
Typically, a fault diagnosis may automatically trigger local actions within the component, forward
data to management middleware for action, and also may generate notification of administrative
personnel. The management middleware should handle each of these cases.
Isolation contains a fault to keep it from spreading throughout a system. Configuration and
dependency relationships must be understood in order to partition appropriate fault containment
regions.