Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
102
It is inevitable that a situation will occur which requires a rejuvenation of the system. This reboot
operation is needed from an application trigger to be used as a last resort in part of a recovery
operation.
11.5 Resilience
Resilience is the property of a component that allows it to continue full or partial function after
some or multiple faults occur. Thus, resilient components provide a higher availability level than
comparable non-resilient components.
The basic category of resilience applies to the actions taken from an operation. Beyond style,
platform features for common operations should be provided. All interactions with the high
availability interfaces need to check the inputs and validate the outputs. In addition, common
techniques for redundancy and voting operations can be provided to improve the resilience of an
application.
Properties of resilient components include the ability to detect a fault and compensate for the fault.
This may be utilizing alternate resources, or correcting the error. For example:
• Error-correcting memory can be an example of a resilient component. If errors are detected
and corrected, an area of memory can be taken out of service and moved to another section of
memory that is operating correctly. Then the memory addresses would be re-mapped to use the
good memory.
• In software, database data is required to be correct, but disk writes of data with
interdependencies cannot be written atomically. Databases are resilient to non-atomic writes,
by adhering to a strict sequence of writes so that any errors can be corrected.
Using these and other similar methods of design can make an application or system resilient to
faults. If one analyzes the faults that could occur in a system and puts methods in place to
automatically resolve the faults, it increases system availability. The highest service availability
values are usually reached with a combination or resilience, redundancy, and good design practice.