Providing Open Architecture High Availability Solutions

Common techniques for recovery start with the ability to have some level of redundancy. Typically

redundancy for a recovery action is either in time or in space. A redundant component is one that

can be connected to the same inputs and can provide the same outputs as another component. If this

component is always connected to the inputs, the failover process is simply to throw the output

switch from the currently active device to the redundant backup device (switchover). The state

data is maintained by the input stream, so no interaction between the primary and backup would

need to take place. This is redundancy in space (spatial redundancy) because there is essentially no

time needed to perform the switchover.

Another method of spatial redundancy is if the active component periodically captures the state and

forwards it to a standby that is just validating the state and storing it. In the event that the active is

no longer providing the service, a recovery action would be to restart the operation from the last

known good state on the standby system. Both of these approaches can be referred to as

active/standby.

To continue with the networking theme and carry forward from the isolation phase, re-routing and

the acknowledgement or negative acknowledgement of TCP packets would be a recovery action as

well. In this case the protocol is built to be a reliable transport so the negative acknowledgements,

or lack of any acknowledgement, will cause the protocol to start the recovery action. This is

redundancy in time, or temporal redundancy.

The following sections address the reloading or restarting of an application, protocol or even the

operating system. This typically requires graduated levels of time to do the recovery.

6.4.5 Techniques

Switchover. This technique is used in a redundant operation and can refer to the switchover of any

component to another whether it is performing the same operation and only switches the output or

that it recovers from a known state. The component in this case could be hardware or software. The

redundancy can be of a peripheral component or even the core processing element. This typically is

identified as a 2N redundancy.

Re-routing. This technique could also be referred to as load balancing or load sharing. This is used

with an N + 1 or more generally N + M redundancy. This means that there is N components

providing service and any one of those components could be replaced by M other components.

Software Rejuvenation. The simplest form of this technique is by the example of a screen refresh

— it rejuvenates the image. In the general case this would be the restart or reloading of an

application, dynamic library or protocol to cause a system to reset and initialize the resources it

uses to perform the service.

Reboot. This is the most radical of the rejuvenation steps. This restarts the operating system as well

as the protocols, libraries and applications.

6.4.6 Dependencies

Recovery depends on the policies, redundancy elements and the techniques which are typically part

of set of policy descriptions that are used when the system is configured. It is the last of the process

steps for service restoration.