Providing Open Architecture High Availability Solutions

OSs may also support a structured and polled system MIB. This MIB typically structures the kernel

information according to a published structure. This information can be directly incorporated into

the element management mechanism, or parsed, ad hoc by the middleware, to garner required

system status and state information. This management interface to the OS layer typically conforms

to one or more of the industry-standard network, element, or web management protocols.

A desirable capability at the OS layer is the autonomous generation and communication of kernel

information when certain installable characteristics are approached, matched, or exceeded. This OS

layer management capability is often used to provide immediate and proactive information of

unusual conditions that may, when combined with other higher-level system knowledge, be used to

predict impending faults and to take corrective action in advance of actual failures.

Structured Error Logging

The purpose of the event logging function is to provide a mechanism for logging system,

application and exception information and for subsequent processing of that information. The

active system log should itself be highly available to accept event records. There are typically

extended functions to archive and distribute the system event log. Fault logs can be invaluable

when trying to repair a system. Since faults due to configuration errors may occur at boot time, the

fault logging facility is available soon after the initialization of the kernel.

If an abnormal condition is detected upon checking a return code, argument, or data structure, the

condition must be reported to the user and recorded in the system log. This is true even if an error is

recoverable, as an HA system must not hide errors from the prediction and detection devices.

Desirable OS layer enhancements to the system log function would continually parse the entries by

type, frequency and severity and could provide asynchronous notification of exceptions to the

middleware layer. Ultimately, the OS should provide both asynchronous and directed access to this

information.

Crash Dump Information

In the unavoidable event of a system failure (e.g., OS panic) or application fault (e.g., core dump),

as much information on the error as possible should be captured, reported, and saved (if only

locally) for post-mortem fault debug and diagnostics. To provide more rapid debugging, it may be

desirable to have the option of limiting information dumping during debug.

9.3.6 Configurable Restart/Reboot Behavior

Mechanisms to control the OS behavior upon reboot (typically after a crash) or restart are valuable

in a HA system. Being able to specify that a faulted system should not automatically reboot after a

failure may allow analysis of the failure and potential diagnosis of the exception condition. It also

avoids potential introduction of an unstable component into the overall system. After an in-place

upgrade or a fault, it should be possible to do either a fast restart, a rollback, or a failsafe restart.

9.4 Hot-Swap Software Requirements

Several key software features facilitate hardware hot-swap. Hot-swap for HA requires that system

software resources allotted to a board be reclaimed when the board is extracted, and added when a

new board is inserted. It is required that device control software be installed dynamically and

linked to a running operating system upon hardware insertion, and that applications can be

quiesced (or at least notified) such that pending communication/operations with the board be halted