By Huang Bin
IP network troubleshooting is typically an uphill climb. Fault location is hindered by an excess or absence of alarms, while fault determination is basically a tedious process of elimination. Both are slow and inefficient; a solution would be welcome.
The architecture of an IP network is highly stratified. A virtual leased line (VLL) service must traverse multiple layers of processing, including the physical layer, link layer, routing protocol, MPLS, and VLL. If a physical fiber breaks, the physical layer, link layer, IP transport layer, and VLL will all be affected. Each will send out a large number of TRAP messages.
Correlation between protocols is also complex. A fiber breakdown will generally cause convergence of the routing protocol, hence the changes in multiprotocol label switching (MPLS) and the label distribution protocol (LDP), and the subsequent plethora of TRAP messages.
The absence of an alarm, however, is much more complex. A fault can be defined as the failure of a network element to perform as expected, but what if there are no expectations? IP architecture, thanks to its fluid nature, makes expectations nearly impossible to systematically define.
The control plane determines the path from the source to the destination. On a traditional circuit-switched (CS) network, the administrator configures the active and standby paths. For each packet, its next hop can be clearly expected, along either the active or standby path. With IP networking, the routing protocol selects the path; the router ‘knows' only the next hop, without knowing the expected service path. Therefore, when a breakdown causes route convergence or a path computation error results in divergence, the router fails to generate an alarm.
Huawei once encountered an NGN voice service failure that lasted for more than 40 minutes, yet the IP bearer network generated no alarm. The culprit was found to be an error in label-switched path (LSP) computation, resulting in mismatch between the computation result and intermediate system-to-intermediate system (ISIS) result, yet no alarm occurred because the protocol that established the LSP did not know the expected path.
In terms of the forwarding plane, as IP networking is asynchronous, its forwarding mechanism cannot enable clear expectations. A packet from router X may be destined for router Y, but router Y will not be aware that the packet is coming and therefore will not generate an alarm if it fails to arrive.
The most common fault of this kind involves degradation between routers, resulting in packet loss. Without an alarm, such a fault may not be noticed or located for quite some time. Engineers will have to check each router along the path, without any hint as to where to begin.
Root alarm identification
When a flood of alarms comes in, the root alarm must be determined. According to Huawei statistics, most IP network faults stem from hardware and link degradation. Long-distance links are particularly sensitive to their surrounding environment; breakdowns occur from time to time. This generates a large number of alarms and causes interior gateway protocol (IGP) convergence, which leads to an increase in the number of alarms as IGP alarms trigger LSP alarms. In other words, a link alarm can bring about a multitude of protocol alarms.
For this issue, Huawei proposes a two-pronged approach. First, alarms must be classified (environment, hardware, software, interface, link, protocol, or service), with environment and hardware alarms having priority. When a higher-level alarm is resolved, the correlated protocol alarms should disappear automatically. This approach is simple and practical, and should produce the desired result in a short time. Second, an alarm correlation system should be established by vendors that depends on protocol and service. Correlated alarms would be displayed under the root alarm, so the administrator need only deal with the root alarm directly. This approach is certainly more complete, but it is arduous and time-consuming to establish.
Path expectation & detection
When a fault fails to trigger an alarm, current status must be compared against expectations, and must be done from the control plane and forwarding plane perspectives. Although a dynamic protocol is adopted for the IP control plane, it is still physically-based and involves the shortest path-first (SPF) algorithm. The simpler a network is, the clearer the path expectation will be. Small and medium-sized metropolitan area networks (MANs) typically have fewer layers, while active/standby links are usually adopted between layers for protection purposes. For such a network, faults can be effectively handled as long as network topology diagrams are accurate.
For a large and complex network, the service path is extremely hard to identify from the distribution of physical links. Network simulation can help calculate the expected path, as network configurations and topologies are imported into its software.
After a path expectation is determined, OSS software regularly obtains path status for comparison against it; any mismatch triggers an alarm and prompts the administrator. Tracert can be adopted for small and medium-sized networks with simple architecture, but for large and complex networks where equal-cost multi-paths (ECMPs) occur, Tracert methods should be combined with forwarding table query for service path assessment. Said assessment can also be done by analyzing the flood of IGP packets and calculating the forwarding path via the routing algorithm and configuration.
Forwarding expectation & detection
In the forwarding plane, expectation is closely related to detection, which can be done through non-service-aware OAM, service-aware OAM, or service quality monitoring.
Non-service-aware OAM – This involves OAM-detection packet injection into the network so that expectations are predefined. The recipient therefore already has details concerning the detection packets, including their size and interval. When a received packet defies expectations, it qualifies as a fault.
This method is easy to deploy, as each network layer has an OAM protocol, such as Bidirectional Forwarding Detection (BFD), EthOAM, Internet Control Message Protocol (ICMP) ping, or MPLS OAM. However, mere OAM packets cannot fully illustrate the service situation, as they may not reflect certain service failures, at least not immediately.
Service-aware OAM – This method directly measures the service stream. A typical example would be the loss measurement function defined in the ITU-T Y.1731 standard. Simply put, it is a conservation principle for packets, where the number of received packets equals those sent. In terms of implementation, the sender and recipient both tally the service packets. The sender regularly sends the count to the recipient for double-checking; if their tallies don't match, a fault is declared.
Service quality monitoring – With this method, service data is measured and compared with predefined thresholds. With IPTV service, for example, dedicated hardware is connected to the device port to directly measure IPTV traffic across the network; indicators might include the MOS value for VoIP. This method best reflects real-world conditions, but the deployment and maintenance of dedicated equipment is expensive. It also requires deep packet inspection (DPI), as actual packets must be sampled or analyzed against predefined expectations. These three methods can coexist, but the desirability of this depends on the operator's service SLA goals and procurement & maintenance budget.
Furthermore, the control plane interrelates with the forwarding plane, as the operation of the former directly affects traffic distribution in the latter. Any problems with the control plane may lead to traffic congestion and device/link faults. Huawei, by integrating expectation determination and status detection for the control and forwarding planes, has developed a visualized IP network O&M solution ("path+traffic") which provides all-around fault monitoring and location capabilities.
China Mobile, for example, has been able to slash its average MAN fault ticket tally from 500 per day to 10, thanks to Huawei's ability to reduce the number of false alarms. This solution has also helped operators better visualize their IP network O&M. Usually, it takes operators hours or even days to troubleshoot the most common faults, such as link error, link interruption, component failure, and route error. With Huawei "path+traffic," such common faults can be rectified within minutes, as dictated by internal testing.
Currently, this solution operates on several commercial networks on a pilot basis; results, thus far, have been significant; the maintenance process has been simplified, while the troubleshooting process has been accelerated.