Providing Open Architecture High Availability Solutions

9.1.6 Appropriate panic() Behavior

A catastrophic system failure, or panic() routine is used when a failure occurs which cannot

easily be recovered from. Frequently these failures are system data structure corruptions. The result

of the

panic() routine is to crash the system, with the obvious impact on availability.

By design, a HA-specific OS seeks to minimize panics by replacing most system exception

conditions with appropriate autonomous fault management behavior. However, if the fault is such

that correct operation of the subsystem is no longer ensured, an indication should be set so that

further attempts to access the subsystem fail. If it really is the case that the entire OS’s operation

cannot continue, then an OS panic() should be the only acceptable action. This will force redundant

OS components to assume the role of the failed system. Whenever possible the panic routine

should capture as much descriptive information as possible, on the abnormal termination. It is

highly desirable to support implementation-specific flexibility in how and where this information

is captured.

9.1.7 Handling of Spurious Events

The OS should recognize and handle potentially spurious events, such as spurious IRQs and

controller and bus glitches.

9.2 Notification

However fault domains are constructed, a critical hardware capability is the detection of failures of

fault domains and communication of those failures. This may be communicated out-of-band

through a management data channel, or in-band via unambiguous observable behavior (or non-

behavior). A primary example of the latter is fail-safe behavior, where a fault domain contains a

self-checking capability, which causes it to promptly shut down when a fault is detected. The

resulting shut down is then observed by other parts of the system.

Beyond the immediate communication required for fault diagnosis and isolation, hardware fault

domain failures must also be communicated to appropriate middleware (or people) in order to

trigger recovery and repair actions.

When a system contains fault domains, which are effectively in a standby mode, there is a need for

detection of latent faults in these domains. That is, if the primary failure detection mechanism is

observation of normal operating behavior, the hardware may need to provide a separate mechanism

for detection of faults in fault domains that are not normally operating.

9.3 HA-Enhanced OS Services

Many of the software mechanisms required to support fault management operate in the OS level

because they are intimately related to kernel activities, or are provided at this level for performance

reasons. OS services such as protected address spaces or process and data resiliency provide

inherent mechanisms that aid in fault management. Other OS layer capabilities like a secure file

system are relatively independent and usually require no directed intervention by middleware. An

example would be a journaling file system that speeds system restart and file system checking. The

middleware does not have to control these functions because they are included in the OS. Hardened

device drivers that check for hardware errors may be required to report exception information

directly to the middleware layer. Open-standard interfaces provide a well-documented and

consistent set of interfaces for fault reporting, fault logging as well as mechanisms for OS layer

capabilities to register with the middleware layer.