Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
101
11.3 State Preservation
A common method for state preservation should be provided for an application. This preservation
should allow for an application to restart at a known state. This preservation may require some
level of replication of the data. In the type where an application is to restart on the current
processor, the preservation could be in volatile, non-volatile memories or even on a storage media
like disk or tape. For an application that is started or running on a standby processor, the state
preservation data would be transported between the active and standby processor. This is referred
to as checkpointing. The underlying mechanism to provide this service needs to be able to assure
that the information is as supplied by the source (data integrity), as well as timely. Additional items
to consider with checkpointing include the idea of a suite of programs. If there is more than one
program, items like synchronization, sequencing, and data committing need to be considered.
Checkpointing should also include a mechanism that identifies whether the application is providing
the service, sharing the service, or backing up the service with another process.
11.4 Recovery
An application’s capabilities to recover from faults in the system are dependent on the capabilities
of the system. From the application’s perspective, the underlying system should provide basic tools
to manage resources. It should not waste or lose resources. The objective of the recovery process is
to restore the system to an operating state, even if it is in a reduced capacity.
An application in a highly-available system will need to be able to control the flow of data, be able
to start other processes or stop other processes, replace or upgrade itself, restart, and even reboot
the underlying system.
One capability of a running operating system is to prevent it from getting congested. Flow control
prevents a system from failing at the application level by consuming all the resources or by
blocking I/O activity. The application needs to use non-blocking I/O or timeouts on I/O processing
to prevent this from occurring.
In some application environments, the ability to start or stop processes from an application
(process control) needs to be supported. A primary focus on this would be the ability to launch
diagnostics in the event that some component was removed from operation due to a fault.
Applications need to keep track of the versions of their content. For example, versions of files and
libraries used to build an application should be kept. Based on this information a recovery
operation can replace the entire application or just specific pieces of it.
Restarting an application can be an important step in a recovery of a system service. The faster the
restart of the application the sooner the service is restored. If the setup and processing of an
application are stateless or require minimal state information, then a cold restart can be performed.
This can be either on the current processor or on a redundant processor. If a larger amount of state
information is part of the operation or processing of the application, the warm restart of the
application is necessary. The interface to the checkpointed information is then need to reactivate
the application. In the event a redundant process is used for either the recovery of an application or
the upgrade of an application, careful consideration is needed to address the use of resources such
as file locks, semaphores, or device handles. Understanding the specific information about the
device handle, semaphore or memory resource and how it will be accommodated on the receiving
side is critical to a rapid restart.