Providing Open Architecture High Availability Solutions

5.5.2 Alarms

Alarms are intended to convey critical system exception information in an appropriate, effective

and timely method.

Definition. Alarms are typically autonomously generated messages that are triggered by a specific

causative stimuli. The stimuli might be: a detected fault; a state change (system power-on

generating a trap, or role change generating an alarm); a threshold crossing (a specified parameter,

or rate-of change of that parameter, exceeds or falls below a pre-established value); or when a

particular event type or event severity is recorded (for instance when the storage medium is running

out of available space). A filter process is often used to determine alarm events and to determine

the level.

Concept of Levels. A management scheme may include different escalating levels of alarms.

These may be different LED colors of different types (alarms, alerts, warnings, traps). The levels

may be differentiated by severity or type and may be escalated if not resolved after some defined

interval.

Context/Content. To be useful the alarm should include sufficient information on the context of

the exception, the type and severity (if available) and specific information on the location.

Communication Method. The alarm information can be broadcast to numerous management

interfaces and to any management entity. For instance, a power-on trap message might be

published for everyone to hear, and might be published without regard to whether any entity

received it (unreliable). Alarms can also be directed to particular management entities, and can be

unreliable (as described above), periodic (generated at a interval until resolved), or can be

persistent (continuing to alarm until the alarm is acknowledged, even if the fault has not been

resolved). For instance, the loss of the co-generation facility at a central office would generate an

alarm message that would be both persistent (ensuring that all appropriate management interfaces –

local and remote received the message) and periodic (re-alarming at periodic intervals while

continuing to operate on backup battery power).

5.5.3 Integration with External Network Management Systems

It is critical that a system not only be able to control and monitor itself, but also that it work in a

network of other systems. There are several standard methods that external management systems

use to communicate with the systems they monitor. Most of these methods have defined messages

with room for expansion, and when they are expanded, there is usually an attempt to standardize

messages that compatibility is optimized. The following standards are in use today:

Simple Network Management Protocol (SNMP). SNMP is a set of protocols that pass

information about whether or not a component is operating properly. SNMP uses Management

Information Bases (MIBs) to store data about a system.

Remote Monitoring (RMON). RMON allows network usage information to be gathered by

providing new sets of MIBs. Network devices in a system must be RMON compliant or RMON

will not work.

Common Management Information Protocol (CMIP). CMIP is an OSI standard protocol used

with CMIS (common management information services). CMIS defines a system of network

management information services. CMIP was proposed as a replacement for the less sophisticated

SNMP, but has taken root only in the telecom space. CMIP provides improved security and better

reporting of unusual network conditions.