Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
98
10.5 Performing Rapid Recovery
In general, completion of the fault management cycle includes recovery from the fault, as well as
reporting to administration and repair and reintegration as needed. In complex systems with both
parent/child dependencies and multi-layered service dependencies, the management of fault
recovery actions requires multiple factoring of hierarchies of cluster-wide dependency and
availability issues. It needs to be handled by the higher-level availability management functions. In
order to build a high availability system, redundancy is often configured across multiple nodes.
Fault management on a single node can handle local recovery, but the cluster-unified management
is required for fault management that crosses node boundaries. Notification and repair are part of
interfacing outside the self-management structure.
Recovery includes any actions taken to restore the system to service from fault or failure. These
actions can cover a wide range of activities from restart of a failed application to failover to a
standby hardware card. The recovery process is often multi-step in that several actions must be
taken in a prescribed order. In some cases, the recovery process is multi-tiered in that, if a particular
action does not recover the system, some alternative action must be taken. The management
middleware should contain the knowledge of the appropriate recovery action for the failure of each
managed component in the system, whether it involves restoring that component to full operation
or switching over to a redundant component. While not technically part of the recovery process,
similar recovery actions may be taken to proactively detect and resolve latent faults.
Reporting is the notification and logging to systems or people of the diagnosis made as well as any
actions that were taken automatically. For example, if an application crashes, it might be both
recovered by restarting via pre-set protocols and then reported to a system administrator via an e-
mail or page. Notification to the outside world that an event has taken place is the first step in the
repair process. Management middleware should provide a variety of notification methods via e-
mail, SNMP traps, event logs or other messaging paths.
Repair involves the repair or replacement of hardware and software components as necessary.
Reintegration returns the working component into the system. Assuming a system is designed with
redundancy, the functions of a failed or degraded component are switched over to the standby
component as a recovery action. Once the failed component is inactive, repair can begin. Software
repair could involve patches or installation of an upgraded version of the software; either can be
accomplished automatically via remote upgrading. Hardware components would be
decommissioned automatically, but are usually repaired or replaced manually. Reintegration
support by availability management—such as automatic detection and role assignment—avoids
downtime or loss of service.
Quick and accurate detection, diagnosis and isolation of faults (and symptoms that could lead to
faults) enables fault management to prescribe and initiate appropriate recovery actions. Recovery
actions must take into account the dependency consequences of any reconfiguration.
10.6 Dynamically Managing Configuration and Dependencies of
All Components
The management middleware should provide cluster-aware availability management that is
responsible for initiating the actions and orchestrating the role assignments that maintain service
availability. This function should not only register and track component membership in the system,
but it should also understand and manage the overall system dependency and redundancy
configurations. It thus relies on a dynamically populated system model.