Providing Open Architecture High Availability Solutions

10.5 Performing Rapid Recovery

In general, completion of the fault management cycle includes recovery from the fault, as well as

reporting to administration and repair and reintegration as needed. In complex systems with both

parent/child dependencies and multi-layered service dependencies, the management of fault

recovery actions requires multiple factoring of hierarchies of cluster-wide dependency and

availability issues. It needs to be handled by the higher-level availability management functions. In

order to build a high availability system, redundancy is often configured across multiple nodes.

Fault management on a single node can handle local recovery, but the cluster-unified management

is required for fault management that crosses node boundaries. Notification and repair are part of

interfacing outside the self-management structure.

Recovery includes any actions taken to restore the system to service from fault or failure. These

actions can cover a wide range of activities from restart of a failed application to failover to a

standby hardware card. The recovery process is often multi-step in that several actions must be

taken in a prescribed order. In some cases, the recovery process is multi-tiered in that, if a particular

action does not recover the system, some alternative action must be taken. The management

middleware should contain the knowledge of the appropriate recovery action for the failure of each

managed component in the system, whether it involves restoring that component to full operation

or switching over to a redundant component. While not technically part of the recovery process,

similar recovery actions may be taken to proactively detect and resolve latent faults.

Reporting is the notification and logging to systems or people of the diagnosis made as well as any

actions that were taken automatically. For example, if an application crashes, it might be both

recovered by restarting via pre-set protocols and then reported to a system administrator via an e-

mail or page. Notification to the outside world that an event has taken place is the first step in the

repair process. Management middleware should provide a variety of notification methods via e-

mail, SNMP traps, event logs or other messaging paths.

Repair involves the repair or replacement of hardware and software components as necessary.

Reintegration returns the working component into the system. Assuming a system is designed with

redundancy, the functions of a failed or degraded component are switched over to the standby

component as a recovery action. Once the failed component is inactive, repair can begin. Software

repair could involve patches or installation of an upgraded version of the software; either can be

accomplished automatically via remote upgrading. Hardware components would be

decommissioned automatically, but are usually repaired or replaced manually. Reintegration

support by availability management—such as automatic detection and role assignment—avoids

downtime or loss of service.

Quick and accurate detection, diagnosis and isolation of faults (and symptoms that could lead to

faults) enables fault management to prescribe and initiate appropriate recovery actions. Recovery

actions must take into account the dependency consequences of any reconfiguration.

10.6 Dynamically Managing Configuration and Dependencies of

All Components

The management middleware should provide cluster-aware availability management that is

responsible for initiating the actions and orchestrating the role assignments that maintain service

availability. This function should not only register and track component membership in the system,

but it should also understand and manage the overall system dependency and redundancy

configurations. It thus relies on a dynamically populated system model.