Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
91
OSs may also support a structured and polled system MIB. This MIB typically structures the kernel
information according to a published structure. This information can be directly incorporated into
the element management mechanism, or parsed, ad hoc by the middleware, to garner required
system status and state information. This management interface to the OS layer typically conforms
to one or more of the industry-standard network, element, or web management protocols.
A desirable capability at the OS layer is the autonomous generation and communication of kernel
information when certain installable characteristics are approached, matched, or exceeded. This OS
layer management capability is often used to provide immediate and proactive information of
unusual conditions that may, when combined with other higher-level system knowledge, be used to
predict impending faults and to take corrective action in advance of actual failures.
Structured Error Logging
The purpose of the event logging function is to provide a mechanism for logging system,
application and exception information and for subsequent processing of that information. The
active system log should itself be highly available to accept event records. There are typically
extended functions to archive and distribute the system event log. Fault logs can be invaluable
when trying to repair a system. Since faults due to configuration errors may occur at boot time, the
fault logging facility is available soon after the initialization of the kernel.
If an abnormal condition is detected upon checking a return code, argument, or data structure, the
condition must be reported to the user and recorded in the system log. This is true even if an error is
recoverable, as an HA system must not hide errors from the prediction and detection devices.
Desirable OS layer enhancements to the system log function would continually parse the entries by
type, frequency and severity and could provide asynchronous notification of exceptions to the
middleware layer. Ultimately, the OS should provide both asynchronous and directed access to this
information.
Crash Dump Information
In the unavoidable event of a system failure (e.g., OS panic) or application fault (e.g., core dump),
as much information on the error as possible should be captured, reported, and saved (if only
locally) for post-mortem fault debug and diagnostics. To provide more rapid debugging, it may be
desirable to have the option of limiting information dumping during debug.
9.3.6 Configurable Restart/Reboot Behavior
Mechanisms to control the OS behavior upon reboot (typically after a crash) or restart are valuable
in a HA system. Being able to specify that a faulted system should not automatically reboot after a
failure may allow analysis of the failure and potential diagnosis of the exception condition. It also
avoids potential introduction of an unstable component into the overall system. After an in-place
upgrade or a fault, it should be possible to do either a fast restart, a rollback, or a failsafe restart.
9.4 Hot-Swap Software Requirements
Several key software features facilitate hardware hot-swap. Hot-swap for HA requires that system
software resources allotted to a board be reclaimed when the board is extracted, and added when a
new board is inserted. It is required that device control software be installed dynamically and
linked to a running operating system upon hardware insertion, and that applications can be
quiesced (or at least notified) such that pending communication/operations with the board be halted