Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
10
4. Recovery – The system is adjusted or re-started so it functions properly
5. Repair - A faulty system component is replaced
Capabilities of Major Building Blocks (or Layers)
In order to create open architecture systems, interoperable building blocks must be available. These
blocks can then be combined as needed to create a system. Section 7.0 provides an overview of
how a system is divided into building blocks. The blocks are then discussed in terms of hardware
blocks, operating system blocks, management middleware blocks and how applications use the
blocks.
The capabilities of hardware building blocks are outlined in Section 8.0. Hardware capabilities of
fault-managed systems can be generally categorized into three sets: redundancy to allow continued
processing after failure, highly reliable (often redundant) communication among components, and
management of the components, including fault detection, diagnosis, isolation, recovery, and
repair.
Operating system capabilities and qualities needed for HA environments are discussed in
Section 9.0. The OS capabilities include functions to isolate, prevent the propagation of, or mask
the impact of, hardware and software faults. These functions help prevent errant applications or
faulted hardware from bringing the entire system down. Another area of capabilities include
dynamic reconfiguration and enhanced device drivers to allow graceful replacement of failed
hardware. Additionally, the OS provides services for autonomous fault-management, reporting
faults externally, and interfacing with HA capabilities in middleware and application layers.
Management Middleware, covered in Section 10.0, is the software component that oversees all of
the configuration and fault management services in the system. The availability management
component operates automatically in “real time” and without human intervention. A state-aware
system model represents, models and monitors the status, topology and dependencies of
components. System information is collected and assessed and system anomalies or faults are
detected and diagnosed. Faults are acted upon by dynamically reconfiguring the status,
configuration, and dependencies of the components to rapidly recover and maintain service. A final
capability of management middleware is that it checkpoints (periodically transfers) data between a
component and its redundant units in order to maintain operation during system reconfiguration.
Applications used in HA systems can either depend on the HA system to simply restart them if a
failure occurs, or they can be “HA Aware,” as discussed in Section 10.0. There are many ways for
HA aware applications to participate, control, or operate within a highly available system. The
system should provide a management interface through which an application can monitor
operations and send status, heartbeat and checkpoint information. An application may also need to
receive this type of information from other applications. Finally, an application may need to initiate
a fail-over or other recovery action and be able to be unloaded, loaded and restarted while the
system is operational.