Providing Open Architecture High Availability Solutions

9.0 Layer-Specific Capabilities – Operating System

The operating system hosts applications and provides process scheduling and resource control for

applications, middleware and device drivers. An HA-aware OS provides typical OS services as

well as services that are specifically designed to provide fault-management capabilities either

directly, or by escalating information to other layers for fault management and resolution.

At the OS layer, service availability can be enhanced by capabilities that improve system

reliability. This section describes incremental software capabilities that can be used in the OS and

applications to improve the effective-MTBF across a variety of HA system configurations. The

term robustness is used to describe these additional fault avoidance and reliability capabilities at

the OS layer. Similar design considerations may be used in the development of HA applications.

The OS also isolates, prevents the propagation of, or masks the impact of potential hardware and

software faults. These protection facilities help prevent errant applications and faulted hardware

from bringing the entire system down.

Another area of HA-specific OS capabilities is the support of dynamic reconfiguration. In a

system model where redundant components are expected to fail, spare units must be rapidly

enabled to mitigate the fault, and eventually repair the defective unit. Facilities to support error

detection, termination, and recovery must be independent of system topology. The OS may provide

dynamic topology and resource management facilities and enhanced device drivers for the graceful

replacement of failed hardware, or the initialization of newly inserted devices. Many HA

configurations depend upon the hot-swap of components, frequently referred to as Field-

Replaceable Units (FRU). This functionality considerably increases up-time by allowing in-

service replacement and reprovisioning of malfunctioning hardware without having to power-

down, re-initialize or reboot the entire system.

Finally, the operating system needs to provide services that are specifically designed to provide

autonomous fault-management capabilities (when appropriate), services to report faults externally,

and control interfaces for the directed management of faults and resources from the middleware

and application layers.

High availability designs impose unique and new requirements upon an operating system. To

support high availability an operating system must have facilities to minimize downtime in light of

underlying hardware that will fail. In a HA design, system devices and supporting software must be

dynamic, and the topology and state of a system’s hardware and software needs to be maintained

and reported to management software and system operators.

9.1 OS Robustness

Improving the reliability of the OS is the process of modifying existing software to improve its

behavior in terms of reliability, availability and faults. This process consists of adding to the OS the

ability to:

• Detect, diagnose, isolate, and recover from faults in both hardware and software

• Avoid accidental fault masking; if a fault occurs, ensure that it is appropriately reported and

handled

• Provide warning of unusual conditions that may, when combined with other higher-level

system knowledge, warn of an impending fault

• Provide information to assist isolation of the failure post-mortem

• Provide dynamic kernel and application event trace and profiling facilities