Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
65
System Log. Event information, exception conditions, state changes and context information
should be reported to and recorded in a structured event log, such as a system log.
6.6.4 Approach
State changes (whether generated by faults or not) of hardware and software resources within the
defined system model may signify increased or diminished capabilities, and should generate
immediate autonomous notification messages.
Detection of a fault should generate an immediate autonomous notification message. Depending
upon the type and severity, the message could range from an entry in the system log, a report on the
system console, or as a alarm to the middleware layer or to one or more of the management
interfaces (Section 5.5). As described in this section, the notification message may be delivered as
a best-efforts one-way communication, or as a two-way acknowledged, persistent, or periodic
communication. Likewise, any unresolved fault that straddles more than one layer of the system
should be reported both in-line to related component and out-of-band to appropriate management
interfaces, based upon its type and severity.
After a fault has been detected and reported, the middleware and other management functions may
use component interfaces in a directed manner to invoke the fault management process. This is
typically used to sequence the fault management process, and to isolate propagation of the fault.
In general, system components with a notification capability should retain this fault, status,
performance and state information until: a) cleared by a directed command or by an
acknowledgement; b) capacity is exceeded (may be none, indicating no capacity, or may be some
form of wrap-around register/buffer); c) notification that system fault has been repaired. This
concept could apply to a communications controller chip, a driver with state and error registers, a
protocol stack, kernel, system log, or application.
In the fault management process, it is typical to use request/acknowledge, or other forms of reliable
communications methods, as available. After the initial report of a fault, it is not immediately
known whether the notification reports the original fault, or a subsequent fault that has resulted
from the original (as of yet) undetected fault, and whether any other components may have faulted
because of the original fault. Using directed, reliable communications methods in the fault
management process ensures that undetected faults are recognized. A failure to respond to a
directed-acknowledge request within a reasonable period of time becomes a method of detecting
failure. It is important to understand that a failure to respond could indicate that: the responding
unit is faulty, the communication link is unavailable, or that the reception path on the original
sender’s interface is faulty.
6.6.5 Techniques
“Go out with a bang, not a whimper”
Whenever possible, abnormal termination of software components or the OS should be recorded in
as much context and detail as possible to allow for diagnosis and debug of the fault. Typically this
type of information is found in application core and OS crash files. Source file information is
usually needed to fully debug this information.