User guide

9-2
Cisco Media Gateway Manager 5.0 User Guide
OL-5461-02
Chapter 9 Managing Faults
What Is Fault Management?
Service assurance is the overall process of ensuring that the purchased level of service is delivered. The
Element Management System (EMS) plays a key role in maintaining the health of both network elements
and transmission facilities. This is done in conjunction with other systems, typically at the network
management layer and service management layer. The EMS can be the primary repository of detailed
history of NE-specific faults and events, technician action, and performance data.
The steps for successful fault management are:
1. Identify a problem by gathering data about the state of the network (polling and trap generation).
2. Restore any services that have been lost.
3. Isolate the cause, and decide if the fault should be managed.
4. Correct the fault if possible.
9.1.1 What the NE Provides
NEs can provide the following information that is required for effective fault management:
Currently deployed, intelligent NEs provide the management system with the following, which are
required for effective fault management:
Detection of the four main types of failure:
Equipment failure—Detected through failure detection mechanisms built into the hardware, and
through routine exercises and diagnostics.
Software failure—Detected through failure of software checks, and through routine audits.
Communications failure—Detected through defects in the incoming signal or communicated
from the distant end of a trail in an embedded operations channel by the signal processing chip
sets, and through continuous or periodic measurements of incoming and outgoing signal
characteristics. Defects include line coding errors, framing bit errors, parity errors, cyclic
redundancy check errors, and addressing errors. Signal characteristics include, optical or
electrical power, analog signal to noise ratio, and deviation from required voltage or
wavelength.
Environmental failure—Defects could include power faults such as overheating.
Notification of failure—NEs notify Cisco MGM when a failure occurs by generating an alarm
report. The NE can also report a summary of current fault states, or replay its log of historical
failures and clears.
Notification of changes in the operational state of the NE's components—If a component of the NE
is in the fault state, then Cisco MGM should not receive (or expect to receive) further alarms, alerts,
or scheduled performance data from that component if the alarm is not cleared.
Cisco MGM forwards information northbound and integrates with other third party management
systems to give options not directly available in Cisco MGM.
9.1.2 Fault Notification and Maintenance
Fault notification and maintenance can be proactive or reactive:
Proactive notification—Where X notifies Y of a problem regarding a service delivered from X to Y.
Reactive maintenance—Where Y contacts X to query X on potential problems in X’s domain.