This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
Localization of IP Links Faults Using Overlay Measurements Mohammad Fraiwan and G. Manimaran Department of Electrical and Computer Engineering Iowa State University Ames, IA 50010 Email: mfraiwan,
[email protected]
Abstract—Accurate fault detection and localization is essential to the efficient and economical operation of ISP networks. In addition, it affects the performance of Internet applications such as VoIP, and online gaming. Fault detection algorithms typically depend on spatial correlation to produce a set of fault hypotheses, the size of which increases by the existence of lost and spurious symptoms, and the overlap among network paths. The network administrator is left with the task of accurately locating and verifying these fault scenarios, which is a tedious and time-consuming task. In this paper, we formulate the problem of finding a set of overlay paths that can debug the set of suspected faulty IP links. These overlay paths are chosen from the set of existing measurement paths, which will make overlay measurements meaningful and useful for fault debugging. We study the overlap among overlay paths using various real-life Internet topologies of the two major service carriers in the U.S. We found that with a reasonable number of concurrent failures, it is possible to identify the location of the IP links faults with 60% to 95% success rate. Finally, we identify some interesting research problems in this area.
I. I NTRODUCTION Accurate fault detection and localization affects the performance of Internet applications (e.g., VoIP, video streaming, and online gaming), and ISP networks as a whole [1][2]. There is an ever increasing need to reduce rerouting and maintenance times. In order to achieve stable operating environments for ISP networks and the kind of emerging applications they support, we need to accurately locate IP link faults. IP link faults can be caused by many factors, such as fiber cuts, router crashing or misconfiguration, very heavy congestion, or maintenance activities causing unintentional effects. These kinds of failures occur on a daily basis [3], and they may affect packet forwarding even with the existence of backup paths due to the overlap among network paths [4]. The process of fault monitoring goes through three steps. The first step is fault detection, which is done through IP-level management agents via management protocol messages (e.g., SNMP trap and CMIP EVENT-REPORT), or application-level overlay monitoring [5]. These agents generate a set of alarms. After that, fault identification through alarm correlation is performed. The output of the second step is a set of possible fault scenarios. The majority of fault identification algorithms and systems [5][6][7] rely on spatial correlation of observed symptoms and possible fault scenarios. These systems typically generate a set of equally plausible fault locations. This
set of possible faults is non trivial due to lost and spurious symptoms, the overlap among network paths, and the network heterogeneity. The final step, which traditionally has been done by the network operators or administrators, is fault verification through debugging. Fault debugging locates the exact faulty link(s). The focus of this research is fault debugging. Internet Service Providers (ISPs) are supporting their networks with elaborated overlay monitoring systems as part of a Network Measurement Infrastructure (NMI). The goals of such systems range from security monitoring to measuring loss, delay, bandwidth, etc. of a subset or all of the network paths. One of the main features of overlay networks that have been extensively used in overlay-related research is the high overlap among overlay paths, and the high overlap between seemingly disjoint overlay paths at the IP-level [8]. In this paper, we aim at using the synergy of spatial correlation agents and overlay monitoring to design a practical fault debugging system. In the context of fault debugging, spatial correlation agents behave poorly, as they output a growing number of possible fault hypotheses. Overlay networks, on the other hand, provide both a challenge and an opportunity. The challenge is the high overlap among overlay paths at the IP-level, which limits the coverage and diversity of the overlay paths. And provide an opportunity, because of its efficiency, practicality, and path diversity (if well-designed). The contributions of this paper are as follows: • • •
•
We introduce the problem of debugging suspected faulty IP links using overlay measurement paths. We present an algorithm for finding the set of overlay paths to be used for debugging the suspected IP links. We perform several experiments, using real Internet topologies of two major U.S. service carriers, to evaluate our solution and study the amount of overlap between overlay paths at the IP-level. We point out several interesting research problems.
The remainder of this paper proceeds as follows. Section II gives a motivational example. Section III presents our network and monitoring model and its assumptions. The link debugging problem formulation along with the solution is presented in section IV. Section V presents the performance evaluation results. Section VI discusses the related work. We conclude in section VII.
US Government Work Not Protected by US Copyright
5629
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
B
II. M OTIVATION In this section we expose some peculiarities of overlay networks. Also, we show the insufficiency and inefficiency of existing fault detection and localization schemes in the context of overlay networks. There are two dependent aspects to the problem: the subset of overlay paths being monitored, and the IP-links being identified as faulty by the spatial correlationbased management system. Fig. 1 shows an example overlay network and the underlying IP network. Let us assume that the overlay monitoring algorithm has chosen overlay paths E-C-A and B-A as a subset of paths that will cover most of the underlying IP links, or for any other consideration. The properties of the remaining overlay links can be calculated (e.g., link AC), estimated [9], or probed explicitly (e.g., link DC). Let us go through a scenario wherein a fault has occurred in an IP link, which caused path E − C − A to go down, then consider the following: •
•
•
Performing spatial correlation using overlay paths E − C − A and B − A, and the underlying IP links, will generate a set of equally ”good” suspected list of faulty IP links. This set includes links E − N2 , N2 − C, and C −N1 . Note that this set could also have been generated by a management system at the IP layer. A network administrator will need to debug these suspected faulty IP links. Either by checking each link physically one by one (i.e., white box testing), or through end-to-end measurements (i.e., black box testing). Probing overlay path D−C−A would be sufficient. Since node C is an overlay node and thus can report back if link C − N1 is faulty, if it does not report back, then the assumption is that link N2 − C is faulty, and if path D − C − A is working, then link E − N2 is faulty.
From this simple example, we can identify several issues involving this problem: 1) An awful choice would have been for node B to probe nodes C and D. Since there is no overlay link between nodes B and C, and between nodes B and D. The paths will have to go through node A first (because of overlay routing), which will result in a higher link stress (i.e., the number of probes crossing the same link). For example, link A − N1 would have a stress of 4 probes, and link N1 − C would have a stress of 2 probes. 2) The existence of overlay link DC has greatly improved the solution. This overlay link could not have existed for the given overlay topology. 3) The overlay needs to cover the underlying IP network or the portion containing the suspected list of IP links. This coverage need to be sufficient for fault debugging (i.e., sufficient number of good paths). In this paper, we will focus on the research problem of choosing a subset of the existing overlay paths that can cover and debug the suspected faulty set of IP links.
D Overlay Network
A
E
C
B N1 A
C
D N2
IP Network E Monitoring Paths Debugging Paths
Fig. 1: A Motivational Example. Nodes N1 and N2 are nonoverlay nodes.
III. N ETWORK M ODEL AND A SSUMPTIONS Throughout this paper, we make the following assumptions that are commonly used in the literature: 1) We represent the IP-level network as an undirected graph, GIP = (VIP , EIP ), where VIP is the set of physical nodes and EIP is the set of physical edges. Also, we are given the overlay network undirected graph Go = (Vo , Eo ), Vo ⊆ VIP , but Eo EIP in general. 2) The set of suspected faulty IP links is generated by a fault detection and alarm correlation system. These alarms are produced by management agents via management protocol messages (e.g., SNMP trap and CMIP EVENT-REPORT). Such systems typically operate at the IP-level of the network; as opposed to the application-level overlay monitoring discussed in the motivational example. This suspected list will need to be debugged and verified operational. 3) Network-level topology information is available [8][10][11]. Also, tools that provide topology information have been developed [12]. 4) Overlay paths are stable for a reasonable amount of time (i.e., order of minutes) [13], to allow for the aggregation of the monitoring results. This is true since path changes are triggered by quality changes discovered by probing, while the opposite does not necessarily hold [8]. IV. T HE L INK D EBUGGING P ROBLEM A ND S OLUTION Given the set of the overlay paths P and the set of suspected faulty IP links S that are covered by P , along with the overlay and IP-level graphs. The goal is to use these overlay paths in debugging the suspected set of faulty IP links, without adding new overlay paths or changing the mapping of IP links to their constituting overlay paths. A. Problem Formulation The following variables are used in the formulation: • S = {S1 , . . . , S|S| }: The set of IP links that need debugging (i.e., suspected faulty links). • P = {P1 , . . . , P|P | }: The set of available overlay paths. 1 if IP link Si is to be debugged by path Pj , • xij = 0 otherwise where i ∈ {1, 2, . . . , |S|}, j ∈ {1, 2, . . . , |P |}.
5630
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
•
wij : The links’ weights, which are used to indicate the degree of fault suspicion in the link, or the importance of the link when adding new paths. These weights are important for on going works, but not for the scope of this paper. Thus they are set to 1 for the rest of the paper.
We formulate the link debugging problem as a generalized weighted bipartite matching problem. The corresponding LP formulation is as follows: Maximize wij xij i,j
Subject to
j
xij = 1, ∀ Si ∈ S xij = 1, ∀ Pj ∈ P
i
xij ∈ {0, 1}, ∀ i, j 0 ≤ wij ≤ 1
Note that the above formulation does not guarantee path disjointedness. It is useful and efficient in the case where the underlying IP network is well-covered by the overlay, and any link appear in a number of paths. B. Flow Network Solution The solution involves transforming the LP program into a flow network on which the bipartite matching is conducted. The flow network is constructed as follows. In the first bipartite group, we add a node for each suspected IP link in S, while in the second bipartite group we add a node for each overlay path in P . A unit-capacity edge is added between each node representing Si ∈ S and nodes representing Pj ∈ P , if IP link Si is part of the overlay path Pj . A dummy source s, and a dummy sink t is added with unit-capacity edges to all nodes representing elements in S and P respectively. The flow network construction procedure is shown in Alg. 1, with a complexity of O(|S||P |). The maximum matching can know be found using the maximum flow algorithm (i.e., FordFulkerson [16]), which has a complexity of O((|S|+|P |)|E|), where |E| is the number of edges in the bipartite graph.
input : The set of overlay paths P = {P1 , P2 , . . . , P|P | }. Set of suspected Links S = {S1 , S2 , . . . , S|S| }. GIP , Go . output: Set of verifiable links Sv , and the set of debugging paths Pv Construct the flow network G = (V, E) as follows: • For each path Pj ∈ P , create path vertex uj . • For each link Si ∈ S, create link vertex ui . foreach ui : Si ∈ S do foreach Pj ∈ P do if Si is part of Pj then Create an edge between link vertex ui and path vertex uj with unit capacity and weight wij = 1 end end end • Add a dummy source s with unit capacity edge to each link vertex. • Add a dummy sink t with unit capacity edge to each path vertex. • Solve the flow network problem using the max flow algorithm [16]. • Sv = {Si }, Pv = {Pj } : xij = 1 . Algorithm 1: Flow Network construction B
Overlay Network
A
E
B N1
S2
S3
A
D N2
S1
C
IP Network E
Fig. 2: An insufficient number of paths. IP links N2 − C and C − N1 are indistinguishable. Node C is a non-overlay node. Links S1=E-N2
C. An Illustrative Example The example in Fig. 2 is a slight modification of the motivation example in section II. In this example, there is four overlay paths, hence P = {A − B, D − A, D − E, E − A}. Node C has not been selected as an overlay node. The list of suspected faulty links is the same as before (i.e., S = {E − N2 , N2 − C, C − N1 }). The corresponding flow network is shown in Fig. 3. Only suspected link S1 is debugged successfully, the other two links S2 , and S3 are each matched with a path containing the other. Thus, it is impossible to know which one is faulty (with the current problem setup). The performance evaluation results will show that it is possible to get successful debugging results with high success rate.
D
1
s
Paths (1,1)
(1,1)
1
1
S2=N2-C
(1,1)
P2=D-A 1
t
1
(1,1) 1
P1= D-E
(1,1) (1,1)
S3=C-N1
P3=E-A P4=B-A
Fig. 3: The flow network representation for the example in Fig. 2. Only link S1 is debugged successfully. The edge labels represent (weight, capacity).
5631
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
V. P ERFORMANCE E VALUATION
1 0.9 0.8 0.7 0.6 CDF
We study the link debugging problem on real Internet ISP topologies of two major U.S. carriers provided by RocketFuel[17]: AT&T (AS# 7018, 11800 nodes), and Sprint (AS# 1239, 10332 nodes). A shortest path algorithm, based on hop count, is used for the IP-layer routing. The set of overlay nodes is uniformly selected at random. The number of overlay nodes (N) is set to 16 and 64. The overlay topology is varied in two ways: a complete graph (denoted as complete), where each nodes maintains overlay links to every other node (e.g., RON), and a random graph (denoted as log) where each node maintains log2 N neighbors.
0.5 0.4 0.3 0.2 0.1 0
Path Sharing
0
5
10
15 20 Number of paths
25
30
35
(a) AT&T, Complete, Network size = 16
Verifying Faulty Links
1 0.9 0.8 0.7
CDF
0.6 0.5 0.4 0.3 0.2 0.1 0
0
50
100 150 Number of paths
200
250
(b) AT&T, Complete, Network size = 64 1 0.9 0.8 0.7 0.6 CDF
The amount of sharing among overlay paths at the IPlevel is an important aspect. It gives an incite into the disjointedness of overlay paths, which is necessary to achieve successful debugging. In addition to the amount of coverage of the underlying IP network. Fig. 4 shows the commutative distribution function (CDF) of the number of links shared by a given number of overlay paths, for various network sizes and overlay topologies. It is clear from this figure that there is substantial sharing among overlay paths. For example, for a complete overlay graph of 64 nodes over the AT&T network, 65% of links have participated in about 25 paths or less, this number drops to 5 paths or less for an overlay of size 16, which is a very small overlay size. Note that the paths may or may not be the same. If we change the overlay topology to a random graph, the same numbers drop to 12 paths and 4 paths respectively. The Sprint network exhibits similar properties, but those figures have been removed due to lack of space.
0.5 0.4
VI. RELATED WORK
0.3 0.2 0.1 0
0
2
4
6 Number of paths
8
10
12
(c) AT&T, log, Network size = 16 1 0.9 0.8 0.7 0.6 CDF
How much path disjointedness is it possible to get from an available set of overlay paths? Remember that in order to verify any two links we need two paths such that each path contains one of the links, but not the other. We show the results for the AT&T network, as the Sprint network follows the same trend. We set the number of links requiring debugging (from the currently covered set of IP links) to 5, 10, 15, and 20. Fig. 5a shows the normalized average number of successfully debugged IP links versus the number of suspected IP links for a complete overlay of sizes 16, 32, and 64 nodes. While Fig. 5b shows the same results for a random graph (denoted as log) for the same number of nodes. This figure shows that even though there is significant overlap in the overlay, we are still able to find a very good number of disjoint paths. For example, for 5 suspected links, close to 90% of the links were debugged successfully using a complete overlay of size 64 or 32, the percentage drops to 68% for the humble overlay size of 16. As for the random log graph, the numbers are slightly less impressing as compared to the complete graph.
0.5 0.4 0.3 0.2 0.1 0
0
5
10
15 20 25 Number of paths
30
35
40
(d) AT&T, log, Network size = 64
Fig. 4: The cumulative distribution function (CDF) of the number of IP links shared by a given number of overlay paths.
The area of fault detection and localization has been very active in the past decade or so. An excellent survey of such
5632
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
VII. C ONCLUSION AND F UTURE W ORK 1 16 32 64
Normalized number of verifiable links
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
5
10 15 Number of suspected links
20
(a) A complete overlay network, with number of nodes 16, 32, and 64. 0.9 16 32 64
Normalized number of verifiable links
0.8 0.7 0.6 0.5
Timely and accurate fault debugging is an important aspects in the operation of current ISP networks and the applications they support. This work introduced the problem of debugging faulty IP link using overlay measurement paths. We have formulated this problem as a maximum bipartite matching problem. The results show that, even though there is a large amount of overlap at the IP-level among overlay paths, it is still possible to achieve high debugging capabilities. There are still many aspects of this problem that we are exploring and like to explore in the future. First of all, the coverage of the overlay network, we will look into methods to improve the coverage of the overlay network through multiple disjoint overlay networks as opposed to a single monolithic overlay, in addition to other topology design ideas. Secondly, We would like to introduce more constraints into the problem such as the link stress associated with debugging. Last but not least, we would like to tackle the problem of overlay design for network debugging and fault verification.
0.4
R EFERENCES
0.3 0.2 0.1 0
5
10 15 Number of suspected links
20
(b) A random overlay network, where each node maintains log2 N neighbors.
Fig. 5: The average link debugging capability for various sizes and topologies of the overlay network.
algorithms has been presented by Steinder and Sethi in [5]. SCORE [6][14] is one of the most recent related studies. In SCORE, spatial correlation on a bipartite fault graph is used to identify fault hypotheses that best explain the failure signature with the least number of candidate faults. Such systems where used to detect black holes in MPLS networks [14], or link fault detection in ATM networks. In another paper, wang and AlShaer [15], extend the bipartite fault propagation graph to a probabilistic model that can be used to handle lost and spurious symptoms. We use such spatial correlation engines as an input to our problem. Overlay Network monitoring has received a lot of attention in the recent years for its versatility, flexibility, and ease of deployment. Simple monitoring algorithms (e.g., RON [13]) are not particularly concerned about the measurement overhead. For example, RON uses pair-wise probing with a complexity of O(N 2 ), and assumes that N , the number of overlay nodes, is small (i.e., less than 100 nodes). However, the constant factor in the complexity function is high and may reach orders of Mega Bytes for bandwidth measurements[11]. Several algorithms have been proposed to make use of the link overlap at the IP (Internet Protocol) and the overlay levels [10]. By choosing a subset K