Providing Open Architecture High Availability Solutions

Information Context / Content

The content and context of the notification should be appropriate to the management interface. A

non-recoverable media fault reported from the I/O driver to the calling thread would typically be

limited to return code error information (lightweight). While on the system console, the

notification of the same event might indicate that it was a disk error during a read (this state

information was unnecessary in the context of the affected process since the calling thread knew

whether it was a read or write operation) was encountered at time and date. Even more information

might be recorded in the system log (heavyweight), and terse event notification might be sent to the

middleware and external interfaces, assuming that if they wanted more information and context

that they would make a directed or polled request to system log.

Levels of Fault Management

Implicit in this system and fault management model is the concept of multiple levels of fault

management that may be implemented in each of the layers and have different objectives,

requirements and responses to the same fault conditions.

In general, if a layer capability can autonomously handle the fault, it should independently do so

while reporting the fault and recording the fault management actions. If it cannot autonomously

handle the fault, one of the following conditions may be the cause:

• the fault’s type or severity is beyond the layer’s capability

• the autonomous fault management capabilities of that component have been explicitly

disabled by a fault management entity at a higher level (explicit prohibition)

• the pre-set error thresholds have been exceeded (conditional prohibition)

• the autonomous fault management capability is either not present or not functional because of

temporal or resource conflicts with other layers and service-oriented priorities

There is also more capacity, system knowledge and intelligence at higher level management

entities. However, the latency of the recovery from the fault also seems to grow as more layers are

involved.

Using the disk error example started above, let’s examine the layers of fault management and

notification in a disk read error. In general, errors that occur while reading file system data

structures are propagated all the way up to the original request. Any request to write to a file system

data structure that fails because of a write or read error could result in an inconsistent file system. A

file system that can no longer be accessed because of errors can be unmounted and subsequently

remounted after the reason for the errors has been removed (perhaps a power failure or bad

connection) and the file system is checked.

All device errors are reported from the lowest possible level (the device driver or device manager)

to the console, system log and/or to the middleware management layer. It does not make sense to

report the error at every step of its propagation to the user program. The original cause of the error

is the most useful information to a system administrator or management interface.

More specifically, after the initial fault at the hardware layer, the controller/drive might retry x-

times to eliminate transient and simple seek errors; then it might adapt the skew to attempt to

recover the requested block; devices with ECC, parity, advanced data encoding, mirrored drives,

and RAID-like devices with striping might recover the data automatically, repair the fault by

relocating the information without reporting an in-band error. However, the detailed context of the