Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
62
6.5 Repair
6.5.1 Introduction
Repair, in a live system, requires some form of hot replacement. Again, a system must be designed
to support this activity. To repair a failed component, a replacement is hot inserted, powered on,
connected to the bus, validated through off-line diagnostics, and configured.
6.5.2 Objective
The objective of this process is to return the system to its original capabilities including levels of
redundancy.
6.5.3 Concepts
Physical Replacement. Repair is the replacement of the defective component. This phase is
generally designated for the operator (human) assisted portion of the process. This is usually the
most time consuming portion of the process.
Download. Once the failed component has been replaced and validated as operational the process
of configuration is necessary. This typically consists of loading software into the replaced module
and activation of a driver.
Dynamic Reconfiguration. This is a way to reconfigure by adding new components or removing
old without impacting services running on the system that are not using the components. This is
addressed in Section 5.0.
Off-line Diagnostics. In order to repair a component it may be necessary to run additional
diagnostics which could be run while the component was active. Additionally, these diagnostics are
run before putting a new or repaired component back into service to help ensure that component is
functioning properly
6.5.4 Approach
Performing a repair operation depends on the nature of the failure that was corrected. The normal
repair action would be to replace a component in the system. If this is physical hardware, the repair
craftsperson should identify the defective component and replace it. Diagnostic tests should be run
to test the new component, without affecting the running system. Then, the system should be
configured. The process could include activating the system as the spare, or redundant component,
activating load sharing, updating routing information, or even activating the new component as the
active by forcing a switchover. These are all case-by-case actions that an individual
implementation may choose and are supported by the management framework.
In the case of a software component, the failure could be as easy to repair as downloading the
software again. More than likely the problem was not that a local copy of the component was
damaged or destroyed but the user was not aware of a specific sequence of events that caused it to
fail. This can be corrected through a system patch. Patches are typically small changes in the code,
preferably while the system is performing or using the component. A more encompassing action
would be to replace or upgrade the software. The upgrade process is described more specifically in
Section 5.6. A successful software upgrade requires the ability to precisely identify the version that
is running in the system. An upgrade requires understanding the software’s interactions with other