Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
67
error and recovery might be communicated and captured as warning information. Based upon a
frequency or rate of change threshold, this type of warning might become a stronger alert and then
an alarm notification to the layers and management interfaces above it.
If the disk read condition is passed up to the I/O driver in the OS layer, the driver might attempt its
own form of error recovery, perhaps resetting the controller and trying again. Hardened drivers
check error codes and will eventually time-out and report an error. If the driver does not resolve the
error in a determinate period of time, it would communicate an error that would be reported and
captured.
At the OS layer, if the device is a virtual device, the OS might attempt its own recovery by
replicating the request upon another interface. If still unresolved, the disk read error would be
passed to the calling thread as an error return code, captured in the system log, and reported as a
fault to at least the middleware management interface.
At the application layer, the return code will probably invoke some form of error trapping and
processing to prevent the propagation of this fault to other software components.
At the middleware layer, the notification of the fault event will start an appropriate fault
management process, (try reading or writing to another file, unmounting and checking the disk)
which might include the escalation of this event to a higher level in the form of a trouble ticket or
repair request.
6.6.6 Dependencies
Notification is dependent on the components that are handling faults to generate the appropriate
messages. It is also dependent on the communications and messaging systems to get a message
from the component which is handling a fault to the components that need this information.
6.7 Prediction
6.7.1 Introduction
Prediction is the process of observing the operation of the system and determining when a
component will need to be replaced, repaired or subjected to further diagnostics.
6.7.2 Objective
The objective of prediction is to reduce the occurrence of faults by preemptive notification and
repair.
6.7.3 Concepts
Data Collection. Health data must be collected in the system model to make it possible to
predict faults.
Data Analysis. Models of the types of failures