Providing Open Architecture High Availability Solutions

4. Recovery – The system is adjusted or re-started so it functions properly

5. Repair - A faulty system component is replaced

Capabilities of Major Building Blocks (or Layers)

In order to create open architecture systems, interoperable building blocks must be available. These

blocks can then be combined as needed to create a system. Section 7.0 provides an overview of

how a system is divided into building blocks. The blocks are then discussed in terms of hardware

blocks, operating system blocks, management middleware blocks and how applications use the

blocks.

The capabilities of hardware building blocks are outlined in Section 8.0. Hardware capabilities of

fault-managed systems can be generally categorized into three sets: redundancy to allow continued

processing after failure, highly reliable (often redundant) communication among components, and

management of the components, including fault detection, diagnosis, isolation, recovery, and

repair.

Operating system capabilities and qualities needed for HA environments are discussed in

Section 9.0. The OS capabilities include functions to isolate, prevent the propagation of, or mask

the impact of, hardware and software faults. These functions help prevent errant applications or

faulted hardware from bringing the entire system down. Another area of capabilities include

dynamic reconfiguration and enhanced device drivers to allow graceful replacement of failed

hardware. Additionally, the OS provides services for autonomous fault-management, reporting

faults externally, and interfacing with HA capabilities in middleware and application layers.

Management Middleware, covered in Section 10.0, is the software component that oversees all of

the configuration and fault management services in the system. The availability management

component operates automatically in “real time” and without human intervention. A state-aware

system model represents, models and monitors the status, topology and dependencies of

components. System information is collected and assessed and system anomalies or faults are

detected and diagnosed. Faults are acted upon by dynamically reconfiguring the status,

configuration, and dependencies of the components to rapidly recover and maintain service. A final

capability of management middleware is that it checkpoints (periodically transfers) data between a

component and its redundant units in order to maintain operation during system reconfiguration.

Applications used in HA systems can either depend on the HA system to simply restart them if a

failure occurs, or they can be “HA Aware,” as discussed in Section 10.0. There are many ways for

HA aware applications to participate, control, or operate within a highly available system. The

system should provide a management interface through which an application can monitor

operations and send status, heartbeat and checkpoint information. An application may also need to

receive this type of information from other applications. Finally, an application may need to initiate

a fail-over or other recovery action and be able to be unloaded, loaded and restarted while the

system is operational.