Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
61
Common techniques for recovery start with the ability to have some level of redundancy. Typically
redundancy for a recovery action is either in time or in space. A redundant component is one that
can be connected to the same inputs and can provide the same outputs as another component. If this
component is always connected to the inputs, the failover process is simply to throw the output
switch from the currently active device to the redundant backup device (switchover). The state
data is maintained by the input stream, so no interaction between the primary and backup would
need to take place. This is redundancy in space (spatial redundancy) because there is essentially no
time needed to perform the switchover.
Another method of spatial redundancy is if the active component periodically captures the state and
forwards it to a standby that is just validating the state and storing it. In the event that the active is
no longer providing the service, a recovery action would be to restart the operation from the last
known good state on the standby system. Both of these approaches can be referred to as
active/standby.
To continue with the networking theme and carry forward from the isolation phase, re-routing and
the acknowledgement or negative acknowledgement of TCP packets would be a recovery action as
well. In this case the protocol is built to be a reliable transport so the negative acknowledgements,
or lack of any acknowledgement, will cause the protocol to start the recovery action. This is
redundancy in time, or temporal redundancy.
The following sections address the reloading or restarting of an application, protocol or even the
operating system. This typically requires graduated levels of time to do the recovery.
6.4.5 Techniques
Switchover. This technique is used in a redundant operation and can refer to the switchover of any
component to another whether it is performing the same operation and only switches the output or
that it recovers from a known state. The component in this case could be hardware or software. The
redundancy can be of a peripheral component or even the core processing element. This typically is
identified as a 2N redundancy.
Re-routing. This technique could also be referred to as load balancing or load sharing. This is used
with an N + 1 or more generally N + M redundancy. This means that there is N components
providing service and any one of those components could be replaced by M other components.
Software Rejuvenation. The simplest form of this technique is by the example of a screen refresh
— it rejuvenates the image. In the general case this would be the restart or reloading of an
application, dynamic library or protocol to cause a system to reset and initialize the resources it
uses to perform the service.
Reboot. This is the most radical of the rejuvenation steps. This restarts the operating system as well
as the protocols, libraries and applications.
6.4.6 Dependencies
Recovery depends on the policies, redundancy elements and the techniques which are typically part
of set of policy descriptions that are used when the system is configured. It is the last of the process
steps for service restoration.