Providing Open Architecture High Availability Solutions

System Log. Event information, exception conditions, state changes and context information

should be reported to and recorded in a structured event log, such as a system log.

6.6.4 Approach

State changes (whether generated by faults or not) of hardware and software resources within the

defined system model may signify increased or diminished capabilities, and should generate

immediate autonomous notification messages.

Detection of a fault should generate an immediate autonomous notification message. Depending

upon the type and severity, the message could range from an entry in the system log, a report on the

system console, or as a alarm to the middleware layer or to one or more of the management

interfaces (Section 5.5). As described in this section, the notification message may be delivered as

a best-efforts one-way communication, or as a two-way acknowledged, persistent, or periodic

communication. Likewise, any unresolved fault that straddles more than one layer of the system

should be reported both in-line to related component and out-of-band to appropriate management

interfaces, based upon its type and severity.

After a fault has been detected and reported, the middleware and other management functions may

use component interfaces in a directed manner to invoke the fault management process. This is

typically used to sequence the fault management process, and to isolate propagation of the fault.

In general, system components with a notification capability should retain this fault, status,

performance and state information until: a) cleared by a directed command or by an

acknowledgement; b) capacity is exceeded (may be none, indicating no capacity, or may be some

form of wrap-around register/buffer); c) notification that system fault has been repaired. This

concept could apply to a communications controller chip, a driver with state and error registers, a

protocol stack, kernel, system log, or application.

In the fault management process, it is typical to use request/acknowledge, or other forms of reliable

communications methods, as available. After the initial report of a fault, it is not immediately

known whether the notification reports the original fault, or a subsequent fault that has resulted

from the original (as of yet) undetected fault, and whether any other components may have faulted

because of the original fault. Using directed, reliable communications methods in the fault

management process ensures that undetected faults are recognized. A failure to respond to a

directed-acknowledge request within a reasonable period of time becomes a method of detecting

failure. It is important to understand that a failure to respond could indicate that: the responding

unit is faulty, the communication link is unavailable, or that the reception path on the original

sender’s interface is faulty.

6.6.5 Techniques

“Go out with a bang, not a whimper”

Whenever possible, abnormal termination of software components or the OS should be recorded in

as much context and detail as possible to allow for diagnosis and debug of the fault. Typically this

type of information is found in application core and OS crash files. Source file information is

usually needed to fully debug this information.