Cisco Active Network Abstraction Fault Management User Guide Version 3.6 Service Pack 1 THE SPECIFICATIONS AND INFORMATION REGARDING THE PRODUCTS IN THIS MANUAL ARE SUBJECT TO CHANGE WITHOUT NOTICE. ALL STATEMENTS, INFORMATION, AND RECOMMENDATIONS IN THIS MANUAL ARE BELIEVED TO BE ACCURATE BUT ARE PRESENTED WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. USERS MUST TAKE FULL RESPONSIBILITY FOR THEIR APPLICATION OF ANY PRODUCTS. Americas Headquarters Cisco Systems, Inc.
THE SOFTWARE LICENSE AND LIMITED WARRANTY FOR THE ACCOMPANYING PRODUCT ARE SET FORTH IN THE INFORMATION PACKET THAT SHIPPED WITH THE PRODUCT AND ARE INCORPORATED HEREIN BY THIS REFERENCE. IF YOU ARE UNABLE TO LOCATE THE SOFTWARE LICENSE OR LIMITED WARRANTY, CONTACT YOUR CISCO REPRESENTATIVE FOR A COPY. The Cisco implementation of TCP header compression is an adaptation of a program developed by the University of California, Berkeley (UCB) as part of UCB’s public domain version of the UNIX operating system.
C O N T E N T S About This Guide vii Obtaining Documentation, Obtaining Support, and Security Guidelines CHAPTER 1 Fault Management Overview Managing Events 1-1 1-1 Basic Concepts and Terms 1-2 Alarm 1-2 Event 1-3 Event Sequence 1-3 Repeating Event Sequence 1-4 Flapping Events 1-4 Correlation By Root Cause 1-5 Ticket 1-5 Sequence Association and Root Cause Analysis Severity Propagation 2 Fault Detection and Isolation 1-7 2-1 Unreachable Network Elements Sources of Alarms On a Device Alarm Inte
Contents Correlating TCA CHAPTER 4 3-4 Advanced Correlation Scenarios 4-1 Device Unreachable Alarm 4-1 Connectivity Test 4-1 Device Fault Identification 4-2 Device Unreachable Example 1 Device Unreachable Example 2 4-2 4-2 IP Interface Failure Scenarios 4-3 IP Interface Status Down Alarm 4-3 Correlation of Syslogs and Traps 4-4 All IP Interfaces Down Alarm 4-5 IP Interface Failure Examples 4-5 Interface Example 1 4-6 Interface Example 2 4-6 Interface Example 3 4-7 Interface Example 4 4-7 Interface
Contents CHAPTER 5 Correlation Over Unmanaged Segments 5-1 Cloud VNE 5-1 Types of Unmanaged Networks Supported 5-1 Fault Correlation Across the Frame Relay or ATM or Ethernet Cloud Cloud Problem Alarm 5-3 Cloud Correlation Example CHAPTER 6 5-3 Event and Alarm Configuration Parameters Alarm Type Definition 6-1 6-1 Event (Sub-Type) Configuration Parameters 6-2 General Event Parameters 6-2 Root Cause Configuration Parameters 6-2 Correlation Configuration Parameters 6-3 Network Correlation Paramete
Contents APPENDIX B Event and Alarm Correlation Flow Software Function Architecture B-1 B-2 Event Correlation Flow B-3 Event Creation (VNE level) B-3 Event Correlation B-3 Local Correlation (Event Correlator) B-3 Network Correlation (Event Correlator, Flow) Correlation Logic (Event Correlator) B-4 Alarm Sending (Event Correlator) B-4 Post-Correlation Rule (Event Correlator) B-4 B-3 Cisco Active Network Abstraction Fault Management User Guide, Version 3.
About This Guide This guide includes the following chapters: • Chapter 1, “Fault Management Overview”—Describes how to manage events, and introduces some of the key concepts of Cisco ANA alarm management. • Chapter 2, “Fault Detection and Isolation”—Describes unreachable network elements and the sources of alarms on devices.
About This Guide Obtaining Documentation, Obtaining Support, and Security Guidelines Cisco Active Network Abstraction Fault Management User Guide, Version 3.
CH A P T E R 1 Fault Management Overview This chapter describes the challenge of managing an overabundance of events, and introduces some of the key concepts of Cisco ANA alarm management. • Managing Events—Describes how to manage events effectively. • Basic Concepts and Terms—Describes the basic concepts and terms used throughout this guide. • Severity Propagation—Describes the concept of severity, and how severity is propagated.
Chapter 1 Fault Management Overview Basic Concepts and Terms Figure 1-1 Event Flood Syslog: Lost Neighbor Lost Connectivity Syslog: Lost OSPF neighbor ! Trap: Link Down ! ! Syslog: Lost BGP Neighbor ! Syslog: Lost OSPF neighbor ! Unmanaged Network Syslog: Lost BGP Neighbor ! Syslog: LSP Reroute ! ! IP Backbone Syslog: HSRP Standby -> Active ! Trap: Link Down Syslog: Lost OSPF neighbor Ping: Device Unreachable ! Ping: Device Unreachable ! Lost Connectivity ! Syslog: Lost BGP Nei
Chapter 1 Fault Management Overview Basic Concepts and Terms • Card out • An alarm is composed of a sequence of events, each representing a specific point in the alarm’s lifecycle. Event An event is an indication of a distinct occurrence that occurred at a specific point in time. Events are derived from incoming traps and notifications, and from detected status changes. Examples of events include: • Port status change.
Chapter 1 Fault Management Overview Basic Concepts and Terms Note The event types that will belong to each sequence can be configured in the system registry. An event sequence can consist of a single event (for example, “device reset”). The set of events that should participate in Cisco ANA alarm processing can be configured in the system registry.
Chapter 1 Fault Management Overview Basic Concepts and Terms Figure 1-4 Flapping Event Correlation By Root Cause Root cause correlation is determined between alarms or event sequences. It represents a causal relationship between an alarm and the consequent alarms that occurred because of it. For example, a card-out alarm can be the root cause of several link-down alarms, which in turn can be the root cause of multiple route-lost and device unreachable alarms, and so on.
Chapter 1 Fault Management Overview Severity Propagation From an operator’s point of view, the managed entity is always a complete ticket. Operations such as Acknowledge, Force-clear or Remove are always applied to the whole ticket. The ticket also assumes an overall, propagated severity.
Chapter 1 Fault Management Overview Event Processing Overview The propagated severity of the alarm (the whole event sequence) is always determined by the last event in the sequence. In the above example, when the link-down alarm is open it will have critical severity; when it clears it moves to normal severity. An exception to this rule is the informational event (severity level of info) such as user acknowledge event, which does not change the propagated severity of the sequence (the alarm).
Chapter 1 Fault Management Overview Event Processing Overview Cisco Active Network Abstraction Fault Management User Guide, Version 3.
CH A P T E R 2 Fault Detection and Isolation This chapter describes unreachable network elements and the sources of alarms on devices. In addition, it describes alarm integrity and the integrity service: • Unreachable Network Elements—Describes how the various VNEs use reachability to check connectivity with the NEs. • Sources of Alarms On a Device—Describes the four basic alarm sources that indicate problems in the network.
Chapter 2 Fault Detection and Isolation Unreachable Network Elements Table 2-1 VNE Type Generic VNE Unreachable Network Elements (continued) Checks reachability using • SNMP only (default). During the SNMP test the unit’s “SNMP get” the sysoid of the NE and expects to receive a response When the NE fails to respond General polling is suspended, and a VNE Unreachable alarm is sent to the Cisco ANA Gateway.
Chapter 2 Fault Detection and Isolation Sources of Alarms On a Device Sources of Alarms On a Device The following basic sources of alarms exist in the system which indicate a problem in the network: • Service Alarms—Alarms generated by the VNE as a result of polling (for example SNMP, Telnet). Usually such alarms (for example link down, card out, device unreachable and so on) are configured in such a way that they can become root cause alarms, according to the correlation algorithms.
Chapter 2 Fault Detection and Isolation Integrity Service For example, this line in crontab runs the file every_12_hours.cmd at 11:00AM and 11:00PM: 0 11,23 * * * local/cron/every_12_hours.cmd > /dev/null 2>&1 The integrity service tests can be defined inside the cmd file, for example: echo “`date '+%d/%m/%y %H:%M:%S -'` running integrity.executeTest alarm” cd ~/Main ; ./mc.csh localhost 8011 integrity.
CH A P T E R 3 Cisco ANA Event Correlation and Suppression This chapter describes how Cisco ANA performs correlation logic decisions: • Event Suppression—Describes enabling or disabling port-down, port-up, link-down and link-up alarms on a selected port. • Root-Cause Correlation Process—Describes the root-cause correlation concept. • Root-Cause Alarms—Describes the root-cause alarm and weights concepts. • Correlation Flows—Describes correlation by flow and correlation by key.
Chapter 3 Cisco ANA Event Correlation and Suppression Root-Cause Correlation Process Root-Cause Correlation Process Root-cause correlation is implemented in various stages within the Cisco ANA VNEs. Initially, the system tries to find the root-cause alarm. When a VNE detects a fault and opens an alarm, it attempts to find another open alarm within the same device, which qualifies as the root-cause of the new alarm.
Chapter 3 Cisco ANA Event Correlation and Suppression Root-Cause Alarms Root-Cause Alarms Potential root-cause alarms have a determined weight according to the specific event customization. Refer to Chapter 6, “Event and Alarm Configuration Parameters” for additional information about setting the weights.
Chapter 3 Cisco ANA Event Correlation and Suppression Correlation Flows This method is usually applicable for problems in the network layer and above (OSI network model) that might be caused due to a problem upstream or downstream. An example is an OSPF Neighbor Down event caused by a link-down problem in an upstream router. Another important distinction between Cisco ANA PathTracer and the correlation flow is that the correlation flow may run on an historical snapshot of the network.
CH A P T E R 4 Advanced Correlation Scenarios This chapter describes the specific alarms which use advanced correlation logic on top of the root cause analysis flow: • Device Unreachable Alarm—Describes the device unreachable alarm, its correlation and provides various examples. • IP Interface Failure Scenarios—Describes the ip interface status down alarm and its correlation. In addition, it describes the all ip interfaces down alarm, its correlation and provides several examples.
Chapter 4 Advanced Correlation Scenarios Device Unreachable Alarm Device Fault Identification When a network element stops responding to queries from the management system, one of two things has happened: • Connectivity to that device is lost. • The device itself crashes or restarts. Cisco ANA implements an algorithm that uses additional data to heuristically resolve the ambiguity and declare the root cause correctly.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios Figure 4-2 Device Unreachable Example 2 SW1 ANA Unit L2 R1 L3 Physical links Management connectivity Note R2 L4 SW2 154608 L1 If the device has a single link and it is being managed through that link (in-band management), there is no way to determine if the device is unreachable due to a link down, or the link is down because the device is unreachable.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios Table 4-1 IP Interface Status Down Alarm Name Description Ticketable Correlation allowed Correlated to Severity Interface status down/up Sent when an IP interface changes oper status to “down” Yes Yes Link Down/Device unreachable/Configuration changed Major The alarm’s description includes the full name of the IP interface, for example Serial0.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios Where there is a multipoint setup and only some circuits under an IP interface go down, and this does not cause the state of the IP interface to change to down, then no “ip interface status down” alarm is created. All the circuit down syslogs correlate by flow to the possible root cause, for example, Device unreachable on a customer edge (CE) device.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios Interface Example 1 In this example there is multipoint connectivity between a PE and number of CEs through an unmanaged Frame Relay network. All the CEs (Router2 and Router3) have logical connectivity to the PE through a multipoint subinterface on the PE (Router10). The keep alive option is enabled for all circuits. A link is disconnected inside the unmanaged network that causes all the CEs to become unreachable.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios • An ip interface status down alarm is generated on the PE. The following correlation information is provided: • The root cause is device unreachable: – The ip interface status down alarm is correlated to the device unreachable alarm. – The syslogs and traps for the related subinterfaces are correlated to the ip interface status down alarm.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios unreachable Router2 10.222.1.2 unreachable Router2 10.222.1.3 Interface Example 4 interface Serial0/0.100 point-to-point 10.200.1.2 interface Serial0/0.101 point-to-point 10.200.1.3 Mixed Multipoint and Point-to-point connectivity interface Serial0/0.55 multipoint 10.200.1.10 Frame-Relay cloud Link is "down" Router10 10.222.1.
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios The following correlation information is provided: • device unreachable on the CE: – The Syslog alarm is correlated by flow to the possible root cause, for example, a device unreachable alarm on CE1 ATM Examples Similar examples involving ATM technology have the same result, assuming that a failure in an unmanaged network causes the status of the IP interface to change to down (ILMI is enabled).
Chapter 4 Advanced Correlation Scenarios IP Interface Failure Scenarios The following failures are identified in the network: • A link-down alarm is generated on the PE. • An ip interface status down alarm is generated on the PE. • A device unreachable alarm is generated on the CE. The following correlation information is provided: • Link down on the PE: – The ip interface status down alarm on the PE is correlated to the link-down alarm.
Chapter 4 Advanced Correlation Scenarios Multi Route Correlation Multi Route Correlation The correlation mechanism supports multi route scenarios, thereby eliminating false correlation, and guaranteeing that the correct root cause alarm is reported. The correlation mechanism ensures that if multi-route segments exist then all the alarms found on a certain path (after eliminating invalid paths) are collected into an alarm set.
Chapter 4 Advanced Correlation Scenarios Multi Route Correlation Figure 4-11 Multi Route Correlation Example 2 P2 P3 Link down #2 P1 PE1 CE1 P7 P6 Link down #1 P4 Link down #1 Link down #2 PE3 PE2 P5 Link down #3 MNG core P8 MNG core 182429 Device unreachable CE2 In this case the system will provide the following report: • Root cause—Device Unreachable.
Chapter 4 Advanced Correlation Scenarios Generic Routing Encapsulation (GRE) Tunnel Down/Up Multi Route Correlation Example 4 In this example, two paths exists from CE1 to PE2. Several links went down, and there is a MPLS black hole in the multi route segment. As a result router CE1 became unreachable.
Chapter 4 Advanced Correlation Scenarios Generic Routing Encapsulation (GRE) Tunnel Down/Up Note The GRE Tunnel Alarm Down is supported only on GRE tunnels that are configured with keepalive. When keepalive is configured on the GRE tunnel edge, if a failure occurs in the GRE tunnel link, both IP interfaces of the GRE tunnel will be in Down state.
Chapter 4 Advanced Correlation Scenarios Generic Routing Encapsulation (GRE) Tunnel Down/Up GRE Tunnel Down Correlation Example 2 This example provides a real world scenario, whereby multiple GRE tunnels cross through a physical link. When this link is shut down by an administrator, many alarms are generated. All the alarms are correlated to the root cause ticket "Link down due to admin down”, as illustrated in Figure 4-15.
Chapter 4 Advanced Correlation Scenarios Generic Routing Encapsulation (GRE) Tunnel Down/Up Figure 4-16 shows the Correlation tab of the Ticket Properties dialog box, which displays all the alarms that are correlated to the ticket, including the correlation for each GRE tunnel and its interface status.
Chapter 4 Advanced Correlation Scenarios BGP Process Down Alarm BGP Process Down Alarm The BGP process down alarm is issued when the BGP process is shut down on a device. If a BGP process is shutdown on a device, the BGP neighbor down events will correlate to it as well as all the device unreachable alarms from the CE devices that lost connectivity to the VRF due to the BGP process down on the route reflector.
Chapter 4 Advanced Correlation Scenarios LDP Neighbor Down Alarm Cisco Active Network Abstraction Fault Management User Guide, Version 3.
CH A P T E R 5 Correlation Over Unmanaged Segments This chapter describes how Cisco ANA performs correlation decisions over unmanaged segments, namely, clouds. • Cloud VNE—Describes managing more than one network segment that interconnects with others, over another network segment which is not managed. • Cloud Problem Alarm—Describes the cloud problem alarm, its correlation, and provides an example.
Chapter 5 Correlation Over Unmanaged Segments Cloud VNE Table 5-1 Cloud Types Supported Technology Type Supported When... Logical Inventory Physical Inventory ATM An ATM cloud (representing unmanaged network segments) comprised of ATM switches is connected to routers (managed segments) with ATM interfaces. The ATM interface or sub-interface in the router is IP over an ATM VC encapsulation interface with a VC (VPI or VCI) or VP (VPI) configuration.
Chapter 5 Correlation Over Unmanaged Segments Cloud Problem Alarm Cloud Problem Alarm For some events, when there is no root cause found, a special cloud problem alarm is created. These events are then correlated to the alarm. The cloud problem alarm has a major severity, and is automatically cleared after a delay. Note When required a correlation filter, filters the cloud problem. This enables or disables the ability of an alarm to create a cloud problem alarm, and to correlate to it.
Chapter 5 Correlation Over Unmanaged Segments Cloud Problem Alarm Cisco Active Network Abstraction Fault Management User Guide, Version 3.
CH A P T E R 6 Event and Alarm Configuration Parameters This chapter describes the different options that exist to modify the alarm behavior by editing the appropriate alarm parameters in the system registry. • Alarm Type Definition—Describes the alarm type concept. • Event (Sub-Type) Configuration Parameters—Describes the event and alarm configuration parameters and values that can be controlled through the registry.
Chapter 6 Event and Alarm Configuration Parameters Event (Sub-Type) Configuration Parameters Event (Sub-Type) Configuration Parameters General Event Parameters Parameter Name Description Permitted Values severity Severity level of the event. Either: is-ticketable • CRITICAL • MAJOR • MINOR • WARNING • CLEARED • UNKNOWN • INFO Determines whether the alarm will generate a new True (ticketable) ticket, if there is no root-cause alarm to correlate to.
Chapter 6 Event and Alarm Configuration Parameters Event (Sub-Type) Configuration Parameters Name Description Permitted Values select-root-cause-method Select the class name Used to determine the most fitting alarm to be used from the from the set of possible root causes sets. This set may be a result of a correlation flow set of classes or may represent all alarms in the local Event Correlator component having a correlation key that matches one of the EventData object correlation keys.
Chapter 6 Event and Alarm Configuration Parameters Event (Sub-Type) Configuration Parameters Flapping Event Definitions Parameters If a flapping event application is enabled on an event, then the following parameters control the alarm’s behavior regarding its flapping state: Name Description Permitted values Flapping interval The maximum amount of time in milliseconds between two alarms Positive integer which can be considered as a flapping change.
CH A P T E R 7 Impact Analysis This chapter describes the impact analysis functionality: • Impact Analysis Options—Describes automatic and proactive impact analysis. • Impact Report Structure—Describes the structure of the impact report that is generated. • Affected Severities—Describes the severities used for automatic impact analysis. • Impact Analysis GUI—Describes how the user can view impact analysis information in Cisco ANA NetworkVision.
Chapter 7 Impact Analysis Impact Report Structure Note Each fault which has been identified as potentially service affecting triggers a generation of impact analysis calculation event if it is reoccurring in the network. This chapter describes the automatic impact analysis. For more information about proactive impact analysis, refer to the Cisco Active Network Abstraction NetworkVision User Guide.
Chapter 7 Impact Analysis Impact Analysis GUI Impact Analysis GUI The Impact Analysis GUI is available in Cisco ANA NetworkVision and displays the list of affected service resources which are embedded in the ticket information. This section describes this list. Affected Parties Tab The Affected Parties tab displays the service resources (affected pairs) that are affected (automatic impact analysis) for an event, an alarm, or a ticket depending on which properties window is opened.
Chapter 7 Impact Analysis Impact Analysis GUI • Name—The subinterface (site) name or business tag name of the affected element, if it exists. For more information, refer to the Cisco Active Network Abstraction Managing MPLS User Guide. • Type—The business tag type. • IP Address—If the affected element is an IP interface, the IP address of the subinterface site is displayed. For more information, refer to the Cisco Active Network Abstraction Managing MPLS User Guide.
Chapter 7 Impact Analysis Impact Analysis GUI Figure 7-2 Detailed Report For the Affected Pair The following fields are displayed at the top of the Affected Parties Destination Properties dialog box: • Affected Pair—The details of A side and Z side of the affected pair. • Alarm Clear State—An indication for each pair of the clear state of the alarm. The following states exist: – Not Cleared—There are one or more alarms that have not been cleared for this pair.
Chapter 7 Impact Analysis Disabling Impact Analysis Disabling Impact Analysis You can disable impact analysis for a specific alarm. This option can be set in the Cisco ANA Registry. If impact analysis is disabled the system will report the event with no impact information. The settings can be changed dynamically during system runtime.
Chapter 7 Impact Analysis Accumulating Affected Parties • Link A down includes the accumulation of the report of its own event sequence. It also includes the report of the BGP neighbor loss. Accumulating the Affected Parties In an Alarm When there are two events that form part of the same event sequence in a specific alarm, the reoccurring affected pairs are only displayed once in the Affected Parties tab.
Chapter 7 Impact Analysis Accumulating Affected Parties Cisco Active Network Abstraction Fault Management User Guide, Version 3.
A P P E N D I X A Supported Service Alarms This appendix provides the list of service alarms that are supported by Cisco ANA 3.6. Note If the source of the alarm is an interface with technology which is not supported by Cisco ANA, then the alarm will not be generated. Note If the source of the alarm is an entity which is not modeled by Cisco ANA, for example, an unsupported module, then the alarm will not be generated.
Appendix A Table A-1 Service Alarms (continued) is-correlation -allowed correlate is-ticketable severity weight Sent when all IP interfaces configured above a physical port change operating status to down. true true true MAJOR 750 Sent when an IP interface changes operating status to down.
Appendix A Table A-1 Supported Service Alarms Service Alarms (continued) is-correlation -allowed correlate is-ticketable severity weight The port discard packets value has passed the configured settings. false true true MINOR 0 Dropped Packets The port dropped packets value has passed the configured settings. false false true MINOR 0 13 MPLS interface removed/MPLS interface added When the MPLS interface is true removed and there is no MPLS TE tunnel on the same interface.
Appendix A Supported Service Alarms Shelf Out Table A-1 Service Alarms (continued) is-correlation -allowed correlate is-ticketable severity weight When a traffic engineering tunnel goes down. true true true MAJOR 800 LDP Neighbor Down/Up If a session to an LDP neighbor goes down as the result of a failure in the TCP connection used by the LDP session, or if the interface is no longer running MPLS.
Appendix A Supported Service Alarms Rx Dormant Rx Dormant An Rx Dormant alarm is issued when the traffic received over a physical port (measured as a percentage of the port’s capacity) drops below a predefined threshold. The alarm description includes the current traffic percentage compared with the defined threshold. This alarm provides service providers with a method for identifying customer services that have slowed down significantly or stopped altogether.
Appendix A Supported Service Alarms Link Over Utilized Cisco Active Network Abstraction Fault Management User Guide, Version 3.
A P P E N D I X B Event and Alarm Correlation Flow This chapter describes in detail the flow of alarms and events during the correlation process. • Software Function Architecture—Provides an event correlation flow diagram.
Appendix B Event and Alarm Correlation Flow Software Function Architecture Software Function Architecture Figure B-1 Event Correlation Flow (VNE level) New Event Event Correlation Application If correlation enabled Read Registry parameters No Continue Yes Store alarm for future correlation to it Yes Is correlation allowed Send alarm to gateway Pass to event correlation application No Correlate Yes (correlation delay of two minutes) Wait for flow results Start correlation flow Flow Flow
Appendix B Event and Alarm Correlation Flow Event Correlation Flow Event Correlation Flow Event Creation (VNE level) An event (EventCorrelationData) is created in the VNE level by three different sources: • Device Component (DC)—When processing service alarms. • EventProcessor—After parsing Syslog and SNMP trap. • TCA Extension—After identifying a change in a property in the IMO.
Appendix B Event and Alarm Correlation Flow Event Correlation Flow If it is box-level correlation the event is stored in the application for the correlation delay period and during this period collects all possible root causes having the same correlation delay. If it is flow-level correlation, then the flow will start after the correlation delay. 2. The flow starting and ending points are defined by the event correlation parameters (see Table B-1). 3.