Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
93
10.0 Layer-Specific Capabilities – Management
Middleware
In availability management, hardware or software faults are not avoided, but are expected to occur
and the system is designed to anticipate and work-around faults before they become system
failures. Thus, instead of counting on the hardware to avoid faults, availability—and especially
high availability— design relies heavily on management software to mask and manage faults that
are expected to occur. Fault management takes practical precedence over designing for fault
avoidance; the goal is to anticipate faults and execute fault recovery as quickly as possible.
High availability depends on each part of a system working reliably and in concert to deliver
services without interruption. Hardware and software redundancy enable management software to
replace failed components with the appropriate standby components so that services can remain
available; accomplishing this with minimal downtime or loss of client state requires a unified
approach to collecting, analyzing and acting upon system state information.
HA also requires a high-performance solution. Given both the architectural complexity of service
availability systems as well as the performance standards required to actually achieve continuous
availability, it is crucial that the management software itself carries a low overhead. The
management software must be optimized for speed, efficiency and footprint, so as not to impede
performance and undermine overall availability. Additionally, to account for the fact that the
management middleware itself is subject to failure, the management middleware should have the
capability to monitor its own health as well as the health of management middleware on other
nodes within the system.
The entire availability management cycle must operate automatically in real time, without need of
human intervention. Information about the system must be collected and assessed so the system
can be managed. System components must be represented and their status, topology and
dependencies must be modeled and monitored. System anomalies or faults must be quickly
detected and diagnosed. Fault data must be provided to an intelligent availability management
service so that it can quickly and appropriately respond by initiating actions that reconfigure the
status and functioning of the components as needed to maintain service. In other words, the system
must be self-managing and self-reliant.
Thus, implementation of service availability requires management software that can do all or at
least some of the following:
• Collect system data in real time
• Configure and maintain state-aware model of the total system
• Checkpoint data to redundant components
• Detect, diagnose and isolate faults
• Perform rapid, policy-based recovery
• Dynamically manage configuration and dependencies of the components
• Provide administrative access and control
These requirements are described in greater detail below.