Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
66
Information Context / Content
The content and context of the notification should be appropriate to the management interface. A
non-recoverable media fault reported from the I/O driver to the calling thread would typically be
limited to return code error information (lightweight). While on the system console, the
notification of the same event might indicate that it was a disk error during a read (this state
information was unnecessary in the context of the affected process since the calling thread knew
whether it was a read or write operation) was encountered at time and date. Even more information
might be recorded in the system log (heavyweight), and terse event notification might be sent to the
middleware and external interfaces, assuming that if they wanted more information and context
that they would make a directed or polled request to system log.
Levels of Fault Management
Implicit in this system and fault management model is the concept of multiple levels of fault
management that may be implemented in each of the layers and have different objectives,
requirements and responses to the same fault conditions.
In general, if a layer capability can autonomously handle the fault, it should independently do so
while reporting the fault and recording the fault management actions. If it cannot autonomously
handle the fault, one of the following conditions may be the cause:
the fault’s type or severity is beyond the layer’s capability
the autonomous fault management capabilities of that component have been explicitly
disabled by a fault management entity at a higher level (explicit prohibition)
the pre-set error thresholds have been exceeded (conditional prohibition)
the autonomous fault management capability is either not present or not functional because of
temporal or resource conflicts with other layers and service-oriented priorities
There is also more capacity, system knowledge and intelligence at higher level management
entities. However, the latency of the recovery from the fault also seems to grow as more layers are
involved.
Using the disk error example started above, let’s examine the layers of fault management and
notification in a disk read error. In general, errors that occur while reading file system data
structures are propagated all the way up to the original request. Any request to write to a file system
data structure that fails because of a write or read error could result in an inconsistent file system. A
file system that can no longer be accessed because of errors can be unmounted and subsequently
remounted after the reason for the errors has been removed (perhaps a power failure or bad
connection) and the file system is checked.
All device errors are reported from the lowest possible level (the device driver or device manager)
to the console, system log and/or to the middleware management layer. It does not make sense to
report the error at every step of its propagation to the user program. The original cause of the error
is the most useful information to a system administrator or management interface.
More specifically, after the initial fault at the hardware layer, the controller/drive might retry x-
times to eliminate transient and simple seek errors; then it might adapt the skew to attempt to
recover the requested block; devices with ECC, parity, advanced data encoding, mirrored drives,
and RAID-like devices with striping might recover the data automatically, repair the fault by
relocating the information without reporting an in-band error. However, the detailed context of the