Providing Open Architecture High Availability Solutions
Providing Open Architecture High Availability Solutions
83
The Telcordia and ITU standards for alarming go well beyond defining hardware capabilities,
describing a complete approach to fault management. A complete treatment of this topic is beyond
the scope of this paper, but it is a resource that any organization working on standards for fault
management should consider.
The OSSGR Section 9 Specification defines an alarm as “a trouble or immediate condition that has
an immediate or potential effect on the operation of the operator services system, and requires
some action by a craftsperson to restore normal operation or prevent degradation of service.”
There are three alarm levels defined:
• Critical. Alarm shall be used to indicate a severe service-affecting condition, which requires
immediate corrective action, regardless of time of day or day of week.
• Major. Alarm shall be used to indicate a failure in a major redundant service such that a
further failure would create a critical condition. These troubles may require immediate
craftsperson attention to restore or maintain system capability.
• Minor. Alarm shall be used to indicate troubles, which do not have a serious effect on service
to customers, or troubles in services that are not essential to the primary operation.
The alarm levels are exclusive, with critical taking precedence over major, and major taking
precedence over minor.
Alarm handling strategy is also defined in these standards with three defined levels:
Unit Level. Software recognizes faults in both the hardware and transmitted data, and correlates
these faults to see if multiple ports are seeing the same information and takes corresponding action.
It may switch traffic to some protection hardware and then throw an alarm to the next higher level,
which in this case is the system level.
System Level. At the system level, the faults are again recognized and correlated. That is, it may
map multiple faults to a single action. The actions could include bringing on-line protection
hardware, changing a cross-connect, or simply raising an alarm at the system level. It could include
sending a message to the network level that an alarm exists.
Network Level. The network level replicates what is done at the system level where the faults are
again recognized and correlated. It may map multiple faults to a single action that could include
bringing on-line protection hardware, changing a cross-connect, or simply raising an alarm at the
network level.
Each level can work independently at decision management and enforcing actions but is also
required to interact/inform with all the other levels.
8.5.4 Interconnects
There are upcoming standards in interconnect technology that will become important in HA
systems. Some of the more prominent ones are listed in this section.
Gigabit Ethernet
Gigabit Ethernet is derived from the original 10 Mb/s and the later 100 Mb/s Ethernet. The link
layer protocol and the packet structure is preserved. Ethernet transceivers can drive copper cable or
FR4, and the technology can be used for backplane interconnect. Today most Ethernet networks are
implemented as switched fabric. Likewise, this implementation would be used for backplane
interconnect. Switched gigabit Ethernet uses point-to-point connections between nodes. Switches