Providing Open Architecture High Availability Solutions

The Telcordia and ITU standards for alarming go well beyond defining hardware capabilities,

describing a complete approach to fault management. A complete treatment of this topic is beyond

the scope of this paper, but it is a resource that any organization working on standards for fault

management should consider.

The OSSGR Section 9 Specification defines an alarm as “a trouble or immediate condition that has

an immediate or potential effect on the operation of the operator services system, and requires

some action by a craftsperson to restore normal operation or prevent degradation of service.”

There are three alarm levels defined:

• Critical. Alarm shall be used to indicate a severe service-affecting condition, which requires

immediate corrective action, regardless of time of day or day of week.

• Major. Alarm shall be used to indicate a failure in a major redundant service such that a

further failure would create a critical condition. These troubles may require immediate

craftsperson attention to restore or maintain system capability.

• Minor. Alarm shall be used to indicate troubles, which do not have a serious effect on service

to customers, or troubles in services that are not essential to the primary operation.

The alarm levels are exclusive, with critical taking precedence over major, and major taking

precedence over minor.

Alarm handling strategy is also defined in these standards with three defined levels:

Unit Level. Software recognizes faults in both the hardware and transmitted data, and correlates

these faults to see if multiple ports are seeing the same information and takes corresponding action.

It may switch traffic to some protection hardware and then throw an alarm to the next higher level,

which in this case is the system level.

System Level. At the system level, the faults are again recognized and correlated. That is, it may

map multiple faults to a single action. The actions could include bringing on-line protection

hardware, changing a cross-connect, or simply raising an alarm at the system level. It could include

sending a message to the network level that an alarm exists.

Network Level. The network level replicates what is done at the system level where the faults are

again recognized and correlated. It may map multiple faults to a single action that could include

bringing on-line protection hardware, changing a cross-connect, or simply raising an alarm at the

network level.

Each level can work independently at decision management and enforcing actions but is also

required to interact/inform with all the other levels.

8.5.4 Interconnects

There are upcoming standards in interconnect technology that will become important in HA

systems. Some of the more prominent ones are listed in this section.

Gigabit Ethernet

Gigabit Ethernet is derived from the original 10 Mb/s and the later 100 Mb/s Ethernet. The link

layer protocol and the packet structure is preserved. Ethernet transceivers can drive copper cable or

FR4, and the technology can be used for backplane interconnect. Today most Ethernet networks are

implemented as switched fabric. Likewise, this implementation would be used for backplane

interconnect. Switched gigabit Ethernet uses point-to-point connections between nodes. Switches