Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
88
9.1.6 Appropriate panic() Behavior
A catastrophic system failure, or panic() routine is used when a failure occurs which cannot
easily be recovered from. Frequently these failures are system data structure corruptions. The result
of the
panic() routine is to crash the system, with the obvious impact on availability.
By design, a HA-specific OS seeks to minimize panics by replacing most system exception
conditions with appropriate autonomous fault management behavior. However, if the fault is such
that correct operation of the subsystem is no longer ensured, an indication should be set so that
further attempts to access the subsystem fail. If it really is the case that the entire OS’s operation
cannot continue, then an OS panic() should be the only acceptable action. This will force redundant
OS components to assume the role of the failed system. Whenever possible the panic routine
should capture as much descriptive information as possible, on the abnormal termination. It is
highly desirable to support implementation-specific flexibility in how and where this information
is captured.
9.1.7 Handling of Spurious Events
The OS should recognize and handle potentially spurious events, such as spurious IRQs and
controller and bus glitches.
9.2 Notification
However fault domains are constructed, a critical hardware capability is the detection of failures of
fault domains and communication of those failures. This may be communicated out-of-band
through a management data channel, or in-band via unambiguous observable behavior (or non-
behavior). A primary example of the latter is fail-safe behavior, where a fault domain contains a
self-checking capability, which causes it to promptly shut down when a fault is detected. The
resulting shut down is then observed by other parts of the system.
Beyond the immediate communication required for fault diagnosis and isolation, hardware fault
domain failures must also be communicated to appropriate middleware (or people) in order to
trigger recovery and repair actions.
When a system contains fault domains, which are effectively in a standby mode, there is a need for
detection of latent faults in these domains. That is, if the primary failure detection mechanism is
observation of normal operating behavior, the hardware may need to provide a separate mechanism
for detection of faults in fault domains that are not normally operating.
9.3 HA-Enhanced OS Services
Many of the software mechanisms required to support fault management operate in the OS level
because they are intimately related to kernel activities, or are provided at this level for performance
reasons. OS services such as protected address spaces or process and data resiliency provide
inherent mechanisms that aid in fault management. Other OS layer capabilities like a secure file
system are relatively independent and usually require no directed intervention by middleware. An
example would be a journaling file system that speeds system restart and file system checking. The
middleware does not have to control these functions because they are included in the OS. Hardened
device drivers that check for hardware errors may be required to report exception information
directly to the middleware layer. Open-standard interfaces provide a well-documented and
consistent set of interfaces for fault reporting, fault logging as well as mechanisms for OS layer
capabilities to register with the middleware layer.