Providing Open Architecture High Availability Solutions

10.0 Layer-Specific Capabilities – Management

Middleware

In availability management, hardware or software faults are not avoided, but are expected to occur

and the system is designed to anticipate and work-around faults before they become system

failures. Thus, instead of counting on the hardware to avoid faults, availability—and especially

high availability— design relies heavily on management software to mask and manage faults that

are expected to occur. Fault management takes practical precedence over designing for fault

avoidance; the goal is to anticipate faults and execute fault recovery as quickly as possible.

High availability depends on each part of a system working reliably and in concert to deliver

services without interruption. Hardware and software redundancy enable management software to

replace failed components with the appropriate standby components so that services can remain

available; accomplishing this with minimal downtime or loss of client state requires a unified

approach to collecting, analyzing and acting upon system state information.

HA also requires a high-performance solution. Given both the architectural complexity of service

availability systems as well as the performance standards required to actually achieve continuous

availability, it is crucial that the management software itself carries a low overhead. The

management software must be optimized for speed, efficiency and footprint, so as not to impede

performance and undermine overall availability. Additionally, to account for the fact that the

management middleware itself is subject to failure, the management middleware should have the

capability to monitor its own health as well as the health of management middleware on other

nodes within the system.

The entire availability management cycle must operate automatically in real time, without need of

human intervention. Information about the system must be collected and assessed so the system

can be managed. System components must be represented and their status, topology and

dependencies must be modeled and monitored. System anomalies or faults must be quickly

detected and diagnosed. Fault data must be provided to an intelligent availability management

service so that it can quickly and appropriately respond by initiating actions that reconfigure the

status and functioning of the components as needed to maintain service. In other words, the system

must be self-managing and self-reliant.

Thus, implementation of service availability requires management software that can do all or at

least some of the following:

• Collect system data in real time

• Configure and maintain state-aware model of the total system

• Checkpoint data to redundant components

• Detect, diagnose and isolate faults

• Perform rapid, policy-based recovery

• Dynamically manage configuration and dependencies of the components

• Provide administrative access and control

These requirements are described in greater detail below.