Providing Open Architecture High Availability Solutions

error and recovery might be communicated and captured as warning information. Based upon a

frequency or rate of change threshold, this type of warning might become a stronger alert and then

an alarm notification to the layers and management interfaces above it.

If the disk read condition is passed up to the I/O driver in the OS layer, the driver might attempt its

own form of error recovery, perhaps resetting the controller and trying again. Hardened drivers

check error codes and will eventually time-out and report an error. If the driver does not resolve the

error in a determinate period of time, it would communicate an error that would be reported and

captured.

At the OS layer, if the device is a virtual device, the OS might attempt its own recovery by

replicating the request upon another interface. If still unresolved, the disk read error would be

passed to the calling thread as an error return code, captured in the system log, and reported as a

fault to at least the middleware management interface.

At the application layer, the return code will probably invoke some form of error trapping and

processing to prevent the propagation of this fault to other software components.

At the middleware layer, the notification of the fault event will start an appropriate fault

management process, (try reading or writing to another file, unmounting and checking the disk)

which might include the escalation of this event to a higher level in the form of a trouble ticket or

repair request.

6.6.6 Dependencies

Notification is dependent on the components that are handling faults to generate the appropriate

messages. It is also dependent on the communications and messaging systems to get a message

from the component which is handling a fault to the components that need this information.

6.7 Prediction

6.7.1 Introduction

Prediction is the process of observing the operation of the system and determining when a

component will need to be replaced, repaired or subjected to further diagnostics.

6.7.2 Objective

The objective of prediction is to reduce the occurrence of faults by preemptive notification and

repair.

6.7.3 Concepts

• Data Collection. Health data must be collected in the system model to make it possible to

predict faults.

• Data Analysis. Models of the types of failures