Providing Open Architecture High Availability Solutions

Providing Open Architecture High Availability Solutions
86
9.0 Layer-Specific Capabilities – Operating System
The operating system hosts applications and provides process scheduling and resource control for
applications, middleware and device drivers. An HA-aware OS provides typical OS services as
well as services that are specifically designed to provide fault-management capabilities either
directly, or by escalating information to other layers for fault management and resolution.
At the OS layer, service availability can be enhanced by capabilities that improve system
reliability. This section describes incremental software capabilities that can be used in the OS and
applications to improve the effective-MTBF across a variety of HA system configurations. The
term robustness is used to describe these additional fault avoidance and reliability capabilities at
the OS layer. Similar design considerations may be used in the development of HA applications.
The OS also isolates, prevents the propagation of, or masks the impact of potential hardware and
software faults. These protection facilities help prevent errant applications and faulted hardware
from bringing the entire system down.
Another area of HA-specific OS capabilities is the support of dynamic reconfiguration. In a
system model where redundant components are expected to fail, spare units must be rapidly
enabled to mitigate the fault, and eventually repair the defective unit. Facilities to support error
detection, termination, and recovery must be independent of system topology. The OS may provide
dynamic topology and resource management facilities and enhanced device drivers for the graceful
replacement of failed hardware, or the initialization of newly inserted devices. Many HA
configurations depend upon the hot-swap of components, frequently referred to as Field-
Replaceable Units (FRU). This functionality considerably increases up-time by allowing in-
service replacement and reprovisioning of malfunctioning hardware without having to power-
down, re-initialize or reboot the entire system.
Finally, the operating system needs to provide services that are specifically designed to provide
autonomous fault-management capabilities (when appropriate), services to report faults externally,
and control interfaces for the directed management of faults and resources from the middleware
and application layers.
High availability designs impose unique and new requirements upon an operating system. To
support high availability an operating system must have facilities to minimize downtime in light of
underlying hardware that will fail. In a HA design, system devices and supporting software must be
dynamic, and the topology and state of a system’s hardware and software needs to be maintained
and reported to management software and system operators.
9.1 OS Robustness
Improving the reliability of the OS is the process of modifying existing software to improve its
behavior in terms of reliability, availability and faults. This process consists of adding to the OS the
ability to:
Detect, diagnose, isolate, and recover from faults in both hardware and software
Avoid accidental fault masking; if a fault occurs, ensure that it is appropriately reported and
handled
Provide warning of unusual conditions that may, when combined with other higher-level
system knowledge, warn of an impending fault
Provide information to assist isolation of the failure post-mortem
Provide dynamic kernel and application event trace and profiling facilities