monitoring methods lead to unacceptable degradation in router performance. ..... without compromising performance, reliability, restoration speed, and software.
Preconfiguring IP-over-Optical Networks to Handle Router Failures and Unpredictable Traffic M. Kodialam
T. V. Lakshman
James B. Orlin
Sudipta Sengupta
Bell Laboratories, Lucent Technologies, NJ, USA Massachusetts Institute of Technology, Cambridge, MA, USA Abstract We consider the realization of traffic-oblivious routing in IP-over-Optical networks where routers are interconnected over a switched optical backbone. The traffic-oblivious routing we consider is a scheme where incoming traffic is first distributed in a preset manner to a set of intermediate nodes. The traffic is then routed from the intermediate nodes to the final destination. This splitting of the routing into two phases simplifies network configuration significantly. In implementing this scheme, the first and second phase paths are realized at the optical layer with router packet grooming at a single intermediate node only. Studies like [13] indicate that IP routers are 200 times more unreliable than traditional carriergrade switches and average 1219 minutes of down time per year. Given this unreliability of routers, we consider how two-phase routing in IP-over-Optical networks can be made resilient against router node failures. We propose two different schemes for provisioning the optical layer to handle router node failures – one that is failure node independent and static, and the other that is failure node dependent and dynamic. We develop linear programming formulations for both schemes and a fast combinatorial algorithm for the second scheme so as to maximize network throughput. In each case, we determine (i) the optimal distribution of traffic to various intermediate routers for both normal (no-failure) and failure conditions, and (ii) provisioning of optical layer circuits to provide the needed inter-router links. We evaluate the performance of the two router failure protection schemes (in terms of throughput) and compare it with that of unprotected routing.
Note: An earlier (and shorter) version of this paper appeared in IEEE Infocom 2006 [10]. In this version, we have improved the running time of the combinatorial algorithm in Section VIII from polynomial to strongly polynomial, included proofs for all theorems, added evaluation results on some additional network topologies, and improved the writing/presentation in some parts of the paper. We have also addressed comments from the first round of JSAC review. I. I NTRODUCTION With the increasing use of the Internet for carrying real-time traffic such as Voice-over-IP, it has become necessary for Internet Service Providers (ISPs) to make their data networks highly reliable. Also, service providers have to engineer their networks to handle multiple traffic patterns while avoiding congestion. This requires constant traffic monitoring, traffic forecasts and adapting
the network routing to changing traffic conditions and to failures. This considerably increases the operational complexity. Ideally, ISPs would like to provision their networks so that network operation is robust to changes in traffic patterns (avoiding the need for frequent reconfiguration) while also accommodating failures in a fast and efficient manner. The need to accommodate multiple traffic patterns has led to interest in the hose traffic model [4] where the only traffic assumptions needed are the total amount of traffic entering and leaving each network ingress or egress node. The actual traffic matrix itself need not be known. Several routing and capacity allocation schemes for the hose model have been proposed recently. An important scheme that allows the network to be statically configured so as to accommodate multiple traffic patterns is two-phase routing [8]. Here traffic entering the network, instead of being directly sent to an egress-router, is first sent to an intermediate node, and from there is sent to the final egress-router. The distribution of traffic to the intermediate nodes in the first phase is done in predetermined proportions that depend on the intermediate nodes. In this paper, we focus on two-phase routing in IP-over-Optical networks where routers are interconnected over a switched optical backbone, also called IP-over-OTN (Optical Transport Network). In this architecture, the first and second phase paths are realized at the optical layer with router packet grooming at a single intermediate node only. Studies like [13] indicate that IP routers are 200 times more unreliable than traditional carrier-grade switches and average 1219 minutes of down time per year. Given this unreliability of routers, we consider how two-phase routing in IP-over-OTN can be made resilient against router node failures. The main contribution of this paper is in incorporating mechanisms for guaranteed Quality-of-Service (QoS) routing despite router failures in IP-over-OTN while preserving the traffic-independence properties of two-phase routing. We propose two different schemes for provisioning the optical layer to handle router node failures – one that is failure node independent and static (called failure independent provisioning), and the other that is failure node dependent and dynamic (called failure dependent provisioning). We develop linear programming formulations for both schemes and a fast combinatorial algorithm for the second scheme so as to maximize network throughput. In each case, we determine (i) the optimal distribution of traffic to various intermediate routers for both normal (no-failure) and failure conditions, and (ii) provisioning of optical layer circuits to provide the needed interrouter links. We view this as important progress for two-phase routing in IP-over-OTN towards achieving carrier-class reliability so as to facilitate its future deployment in ISP networks. The combinatorial algorithm developed for failure dependent provisioning is a Fully Polynomial Time Approximation Scheme (FPTAS). An FPTAS is an algorithm that finds a solution with objective function value within (1 + ²)-factor of the optimal solution and runs in time that
is a polynomial function of the input parameters and 1² . The input parameters in our problem are the number of nodes n and links m in the network, and the size (number of bits) of the input numbers (link capacities and node ingress-egress capacities). The value of ² can be chosen to provide the desired degree of optimality for the solution. We assume a single router failure model under which bandwidth can be shared across different router node failure scenarios. The focus on shared backup bandwidth allocation in this paper is because of its reduced cost and the increased complexity of the optimization problems that arises from sharing backup bandwidth. We focus on throughput (which is the reciprocal of maximum link utilization) as the performance metric because it is directly related to link congestion. Also, it is among the most common metrics used in the literature. The paper is structured as follows. In Section II, we discuss some aspects of the inherent difficulty in measuring traffic and introduce the traffic variation model. Related work is reviewed in Section III. In Section IV, we briefly discuss the two-phase routing scheme so as to provide context for this paper. IP-over-OTN architecture and the application of two-phase routing to it is discussed in Section V. In Section VI, we propose two schemes for protecting against single router node failures in two-phase routing in IP-over-OTN. Algorithms for maximum throughput routing under these two schemes are developed in Sections VII and VIII respectively. We evaluate the performance of these router node failure protection schemes (in terms of throughput) and compare it with that of unprotected two-phase routing in Section IX. For our experiments, we use actual ISP network topologies collected for the Rocketfuel project [14]. We conclude and point to future work in Section X. Proofs of theorems establishing the performance guarantees and running times of the combinatorial algorithm in Section VIII are presented in the Appendix. We briefly describe some notation below before moving on to the next section. A. Notation We assume that we are given a network G = (N, E) with node set N and (directed) edge set E where each node in the network can be a source or destination of traffic. The network G represents the physical WDM topology in IP-over-OTN. Let |N | = n and |E| = m. The nodes in N are labeled {1, 2, . . . , n}. The sets of incoming and outgoing edges at node i are denoted by E − (i) and E + (i) respectively. We let (i, j) represent a directed link in the network from node i to node j. To simplify the notation, we will also refer to a link by e instead of (i, j). The capacity of link (i, j) will be denoted by uij . The utilization of a link is defined as the traffic (sum of working traffic and maximum restoration traffic due to any single router node failure) on the link divided by its capacity.
II. T RAFFIC M EASUREMENT AND VARIABILITY In an utopian network deployment scenario where complete traffic information is known and does not change over time, we can optimize the routing for that single traffic matrix – a large volume of research has addressed this problem. The most important innovation of the two-phase routing scheme is the handling of traffic variability in a capacity efficient manner through static preconfiguration of the network and without requiring either (i) measurement of traffic in realtime or (ii) reconfiguration of the network in response to changes in it. We address the difficulties associated with (i) in this section and then introduce the traffic variation model. The difficulties associated with (ii) for IP-over-Optical networks are addressed in Section V-B. A. Difficulties in Measuring Traffic Network traffic is not only hard to measure in real-time but even harder to predict based on past measurements. Direct measurement methods do not scale with network size as the number of entries in a traffic matrix is quadratic in the number of nodes. Moreover, such direct real-time monitoring methods lead to unacceptable degradation in router performance. In reality, only aggregate link traffic counts are available for traffic matrix estimation. SNMP (Simple Network Management Protocol) provides these data via incoming and outgoing byte counts computed per link every 5 minutes. To estimate the traffic matrix from such link traffic measurements, the best techniques today give errors of 20% or more [15], [21]. The emergence of new applications on the Internet like voice-over-IP, peer-to-peer, and videoon-demand has reduced the time-scales at which traffic changes dynamically, making it impossible to extrapolate past traffic patterns to the future. Currently, ISPs handle such unpredictability in network traffic by gross over-provisioning of capacity. This has led to ISP networks being under-utilized to as low as 20% [15]. B. Traffic Variation Model We consider a traffic variation model where the total amount of traffic that enters (leaves) an ingress (egress) node in the network is bounded. This is known as the hose model and was proposed by Fingerhut et al . [5] and subsequently used by Duffield et al. [4] as a method for specifying the bandwidth requirements of a Virtual Private Network (VPN). Note that the hose model naturally accommodates the network’s ingress-egress capacity constraints. One can obtain such capacity bounds by using, for example, the total capacity of all external ingress links at each node, and compensating for (leaving out) local router traffic. (Two-phase routing does not depend on how these bounds are obtained.)
We denote the upper bounds on the total amount of traffic entering and leaving the network at node i by Ri and Ci respectively. The point-to-point matrix for the traffic in the network is thus constrained by these ingress-egress link capacity bounds. These constraints are the only known aspects of the traffic to be carried by the network, and knowing these is equivalent to knowing the row and column sum bounds on the traffic matrix. That is, any allowable traffic matrix T = [tij ] for the network must obey X
tij ≤ Ri ,
j∈N,j6=i
X
tji ≤ Ci ∀ i ∈ N
j∈N,j6=i
For given Ri and Ci values, denote the set of all such matrices that are partially specified by their row and column sums by T (R, C), that is T (R, C) = {[tij ] |
X j6=i
tij ≤ Ri and
X
tji ≤ Ci ∀ i}
j6=i
We will use λ · T (R, C) to denote the set of all traffic matrices in T (R, C) with their entries multiplied by λ. Note that the traffic distribution T could be any matrix in T (R, C) and could change over time. Two-phase routing provides a routing architecture that does not make any assumptions about T apart from the fact that it is partially specified by row and column sum bounds and can provide QoS guarantees for routing all matrices in T (R, C) without requiring any detection of changes in traffic patterns or dynamic network reconfiguration in response to it. III. R ELATED W ORK Direct routing from source to destination (instead of in two phases) along fixed paths for the hose traffic model has been considered by Duffield et al. [4] and Kumar et al. [7]. In related work, Azar et al. [2] consider direct source-destination routing along fixed paths and show how to compute relative guarantees for routing an arbitrary traffic matrix with respect to the best routing for that matrix. However, they do not provide absolute bandwidth guarantees for routing variable traffic under the hose model. Two aspects of direct source-destination path routing, namely, (i) the source needs to know the final destination of a packet for routing it, and (ii) the bandwidth requirements of the (fixed) paths change with traffic variations, render them unsuitable for some network architectures and applications. Because of (i), these methods cannot be used to provide indirection in specialized service overlay models like i3 where the destination of a packet is not known at the source. Because of (ii), the adaptation of these methods for IP-over-Optical networks necessitates detection of changes in traffic patterns and dynamic reconfiguration of the provisioned optical layer circuits in response to it, a functionality that is not present in current IP-over-Optical network deployments.
Our current work is a sequel to [8]. In [20], a restricted version of the scheme with equal traffic split ratios of n1 and equal ingress-egress capacities (Ri = Ci = c for all i) is considered. The authors in [20] work with a full-mesh IP layer topology (fully connected complete graph), so that the Phase 1 and Phase 2 paths are one hop in length. These paths need to be routed (via multi-hop paths) on the physical WDM topology (which is a sparse graph), an important aspect which they do not consider. Our problem formulation for two-phase routing in [8] (and in this paper) models the multi-hop routing of Phase 1 and Phase 2 paths on the physical WDM topology. IV. OVERVIEW OF T WO -P HASE ROUTING In this section, we give an overview of the two-phase routing scheme from [8]. As mentioned earlier, the scheme does not require the network to detect changes in the traffic distribution or reconfigure the network in response to it. The only assumption about the traffic is the limits imposed by the ingress-egress constraints at each node, as outlined in Section II-B. As is indicative from the name, the routing strategy operates in two phases: •
•
Phase 1: A predetermined fraction αj of the traffic entering the network at any node is distributed to every node j independent of the final destination of the traffic. Phase 2: As a result of the routing in Phase 1, each node receives traffic destined for different destinations that it routes to their respective destinations in this phase.
This is illustrated in Figure 1. A simple method of implementing this routing scheme in the network is to form fixed bandwidth tunnels between the nodes. In order to differentiate the tunnels carrying Phase 1 and Phase 2 traffic, we will refer to these tunnels as Phase 1 and Phase 2 paths respectively. (Since each intermediate node forwards the packet to its final destination along Phase 2 paths, this cannot lead to any loops in the routing.) The critical reason the twophase routing strategy works is that the bandwidth required for these tunnels only depends on the ingress-egress capacities Ri , Ci and not on the (unknown) individual entries in the traffic matrix. P Note that the traffic split ratios α1 , α2 , . . . , αn in Phase 1 of the scheme are such that ni=1 αi = 1. Let us elaborate on the routing procedure. Consider a node i with maximum incoming traffic Ri . Node i sends αj Ri amount of this traffic to node j during the first phase for each j ∈ N . Thus, the demand from node i to node j as a result of Phase 1 is αj Ri . At the end of Phase 1, node i has received αi Rk traffic from any other node k. Out of this, the traffic destined for node j is αi tkj since all traffic is initially split without regard to the final destination. Thus, the maximum traffic that needs to be routed from node i to node j during P Phase 2 is k∈N αi tkj = αi Cj . Thus, the traffic demand from node i to node j during Phase 2 is αi Cj .
Source Node
Phase 1 Routing
Intermediate Node
Phase 2 Routing
Destination Node
Physical View
Fig. 1.
Source Node
Phase 1 Path
Intermediate Node
Phase 2 Path
Destination Node
Logical View
Two-Phase Routing.
Thus, the maximum demand from node i to node j as a result of routing in Phases 1 and 2 is t0ij = αj Ri + αi Cj . Note that this does not depend on the traffic matrix T ∈ T (R, C). Thus, the scheme handles variability in traffic matrix T ∈ T (R, C) by effectively routing a transformed matrix T 0 = [t0ij ] that depends only on aggregate ingress-egress traffic constraints and the distribution ratios α1 , α2 , . . . , αn , and not on the specific matrix T ∈ T (R, C). This is what makes the routing scheme oblivious to changes in the traffic distribution. An instance of the scheme requires specification of the traffic split ratios α1 , α2 , . . . , αn and routing of the Phase 1 and Phase 2 paths. In [9], linear programming formulations and fast combinatorial algorithms are developed for computing the above so as to maximize network throughput. In [12], we extend the routing scheme to providing resiliency against link failures through two different fast restoration mechanisms – local (link/span) based and end-to-end (path) based. The traffic split ratios αi can be generalized to depend on source and/or destination nodes of the traffic, as proposed in [8]. We consider the problem of maximum throughput two-phase routing with generalized traffic split ratios in [11]. We address some practical aspects of two-phase routing related to packet reordering and endto-end delay in the remainder of this section. Packet Reordering In two-phase routing, the source node splits traffic to different intermediate nodes regardless of the final destination. Thus, packets belonging to the same end-to-end connection can arrive out of order at the destination node if the traffic is split within the same connection. Packet reordering can affect the performance of TCP (Transmission Control Protocol) and other traffic
that relies on packet ordering. Packet reordering can be avoided in two-phase routing by splitting traffic at the application flow level at the source node (rather than at the packet level), using the 5-tuple identifier (sourceIP, destIP, sourcePort, destPort, protocolID). Two-phase routing, with per-flow splitting, can still provide bandwidth guarantees to all traffic matrices within the network’s natural ingress-egress capacity constraints. This is because we are advocating the routing scheme for core networks where recent advances in DWDM (Dense Wavelength Division Multiplexing) transmission technologies have resulted in link bandwidths of 10 Gbps and heading higher. Individual flows at best are in the Mbps range and hence small compared to link rates – tens of thousands of flows share a single 10 Gbps link. Increase in End-to-End Delay Because of the two-phase nature of the routing scheme, packets might incur up to about twice the delay in end-to-end routing compared to direct source-destination path routing. Consider an ISP backbone spanning the transcontinental US. A ping from MIT (US east coast) to UC Berkeley (US west coast) gives a round-trip time of about 90 msec. This round trip time includes two traversals of the long-haul network and two traversals each of Boston and Oakland metro access networks. Thus, the one-way end-to-end traversal time with two-phase routing deployed in the long-haul network can be expected to be much less than 90 msec. An end-to-end delay of up to 100 msec is acceptable for most applications. Moreover, the bandwidth guarantees provided by two-phase routing under highly variable traffic reduce the delay variance (jitter). This reduction in jitter and the guarantee of predictable performance to unpredictable traffic provides a reasonable trade-off for the fixed increase in propagation delay. The effect of bursty traffic on jitter is mitigated by the source splitting of traffic to multiple intermediate nodes in two-phase routing. Delay-sensitive traffic that cannot tolerate traversal of the core backbone twice can be routed along shortest paths in a hybrid architecture that accommodates both two-phase routing and direct source-destination path routing. V. T WO -P HASE ROUTING
IN
IP- OVER -O PTICAL N ETWORKS
In this section, we introduce the IP-over-Optical network architecture, discuss the unsuitability of related routing methodologies for IP-over-Optical networks, and then describe the realization of two-phase routing in such networks.
Fig. 2.
Routers interconnected over a switched optical backbone in IP-over-Optical networks.
A. IP-over-Optical Networks Core IP networks are often deployed by interconnecting routers over a switched optical backbone, also called IP-over-OTN (Optical Transport Network). This is illustrated in Figure 2. Because a router line card is typically 3-4 times more expensive than an optical switch card, an IP-over-OTN architecture reduces network cost by keeping traffic mostly in the optical layer [18]. Moreover, the ability of router technology to scale to port counts consistent with multiterabit capacities without compromising performance, reliability, restoration speed, and software stability is questionable [16]. By removing transit traffic from the routers to the optical switches, the requirement to upgrade router PoP configurations with increasing traffic is minimized (since optical switches are more scalable with increasing port count than routers). Also, since optical switches are known to be much more reliable compared to routers [13], this makes the architecture more robust and reliable. Routing in IP-over-OTN needs to make a compromise between keeping traffic at the optical layer (for the above reasons) and using intermediate routers for packet grooming in order to achieve efficient statistical multiplexing of data traffic [18]. B. Comparison with Direct Source-Destination Routing We return to related work on direct source-destination path routing for variable traffic (discussed in Section III) and point out why such methods cannot meet certain requirements in IP-over-OTN. Direct source-destination routing, when applied to IP-over-Optical networks, routes packets from source to destination along direct paths in the optical layer. Note that even though the paths are fixed a priori and do not depend on the traffic matrix, their bandwidth requirements change with variations in the traffic matrix. Thus, bandwidth needs to be deallocated from some paths and assigned to other paths as the traffic matrix changes. (Alternatively, paths between every source-destination pair can be provisioned a priori to handle the maximum traffic between them, but this leads to gross overprovisioning of capacity, since all source-destination pairs cannot simultaneously reach their peak traffic limit in the hose traffic model.) This necessitates dynamic
reconfiguration of the provisioned optical layer circuits (i.e., change in bandwidth) in response to traffic variations. To illustrate this point, consider the scenario in Figure 2, where router A is connected to router C using 3 OC-48 connections and to Router D using 1 OC-12 connection, so as to meet the traffic demand from node A to nodes C and D of 7.5 Gbps and 600 Mbps respectively. Suppose that at a later time, traffic from A to C decreases to 5 Gbps, while traffic from A to D increases to 1200 Mbps. Then, the optical layer must be reconfigured so as to delete one OC-48 connection between A and C and creating a new OC-12 connection between A and D. There has been promising and continuing research on IP-Optical integration and standardization has been proposed under the GMPLS framework for utilizing the optical control plane to provide bandwidth provisioning in real-time to the IP layer. We view our two-phase routing approach (with statically provisioned optical layer) as complementary to network control plane based (dynamic) mechanisms for reconfiguring the network in response changing traffic. Either or a combination of the two approaches may need to be adopted in a network deployment scenario depending on the time-scales and intensity of traffic variation. C. Two-Phase Routing for IP-over-OTN Two-phase routing scheme, as envisaged for IP-over-OTN, establishes the fixed bandwidth Phase 1 and Phase 2 paths at the optical layer. Thus, the optical layer is statically provisioned and does not need to be reconfigured in response to traffic changes. IP packets are routed endto-end with IP layer processing at a single intermediate node only. While in transit at the optical layer inside either Phase 1 or Phase 2 tunnels, packets do enter the router but appear as transit traffic at the Optical Cross-Connect (OXC) only (see Figure 3). The IP layer packet processing at an intermediate node works as follows. The optical layer circuit is dropped at the IP router at the node (through OXC-to-router links), wherein the packets are multiplexed back to the OXC (through router-to-OXC links) to be routed through direct optical layer circuits to their final destinations. This architecture provides the desirable statistical multiplexing properties of packet switching for handling highly variable traffic without significantly increasing the IP layer transit traffic. Compare this with the high levels of IP layer transit traffic in an IP-over-WDM architecture where routers are directly connected to WDM systems and need to process packets at each hop. In summary, two-phase routing when applied to IP-over-OTN leads to an architecture with the following salient features: •
IP traffic is routed “mostly” at the optical layer from source to destination routers with packet grooming at one intermediate router only.
Fig. 3. •
•
Intermediate node packet processing for Two-Phase Routing in IP-over-Optical networks.
The optical layer (circuits and their bandwidth) are statically provisioned a priori to provide bandwidth guarantees for end-to-end IP traffic. Routing at the IP layer is static – there is no need to detect changes in traffic or reconfigure the routing in response to it. Bandwidth guarantees are provided for routing all traffic matrices within the network’s natural ingress-egress capacity constraints.
VI. M AKING T WO -P HASE ROUTING R ESILIENT TO ROUTER FAILURES IN IP- OVER -OTN In this section, we consider extending the two-phase routing scheme for protecting against router node failures. In the term “router node failure”, node refers to a PoP, hence it includes the failure of all routers in a PoP. When a router at any node f fails, any other node i cannot split any portion of its originating traffic to intermediate node f . Hence, it must redistribute the traffic split ratio αf among other nodes j 6= f . Accordingly, let βjf denote the portion of αf that is redistributed to node j when node f fails. Then, we must have X
βjf = αf ∀ f ∈ N
j∈N,j6=f
Note that since only the router (and not the OXC) at node f fails, this node can continue to be on the Phase 1 and Phase 2 paths for optical layer switching. However, the total traffic that was supposed to originate at that node, i.e., Rf , no longer enters the network. We propose two different schemes for provisioning the optical layer in IP-over-OTN in order to handle the redistribution of traffic split ratios after router node failures. We discuss these next. Algorithms for maximizing the throughput under the two protection schemes are presented in Sections VII and VIII respectively. A. Failure Independent Provisioning In the first scheme, called “Failure Independent Provisioning”, for each given i, j ∈ N , the restoration demand from node i to node j at the optical layer is provisioned a priori so as to
handle the “worst case” router node failure scenario. Under this, the maximum additional traffic split ratio that node j needs to handle is maxf ∈N βjf . Thus, the modified split ratio αj0 associated with each node j is αj0 = αj + max βjf ∀ j ∈ N f ∈N
(1)
The demand that needs to be statically provisioned at the optical between nodes i and j so as to protect against any single router node failure is αj0 Ri + αi0 Cj . Since this does not depend on which router node fails, hence the name of the scheme. In equation (1), the value of the traffic split ratio αj0 for intermediate node j is determined by the node f that achieves the maximum value of βjf . For different nodes j, the maximum value of βjf could be achieved by different (failure) nodes f . Thus, for a given router node failure, all the split ratios αj0 may not be required to be as high as the maximum value given by RHS of (1). Also, in this scheme, optical layer bandwidth is shared at the connection/path level (Phase 1 and Phase 2 demands between node pairs) and not at the link level across different failure scenarios. Because of these reasons, failure independent provisioning may not achieve the most capacity efficient sharing of restoration bandwidth across different router node failure scenarios. However, the advantage of the scheme is that it preserves the static nature of the original two-phase routing scheme. B. Failure Dependent Provisioning In the second scheme, called “Failure Dependent Provisioning”, the restoration demand from node i to node j at the optical layer depends on the node f which failed and is provisioned in a reactive manner after the node f fails, the value of the demand being βjf Ri + βif Cj , which could be different for different failed nodes f . By “reactive”, we mean that cross-connects need to be setup for redistributing traffic to other intermediate nodes after failure – this is because backup bandwidth is shared at the link level in the optical layer. The scheme allows increased sharing of restoration bandwidth across different router node failure scenarios and may lead to increased throughput. As we shall show in Section VIII-B, the maximum throughput routing problem for this scheme also admits a fast combinatorial algorithm (FPTAS). VII. M AXIMIZING T HROUGHPUT FOR FAILURE I NDEPENDENT P ROVISIONING Given a network with link capacities ue and constraints Ri , Ci on the ingress-egress traffic, we consider the problem of routing with protection against router node failures through failure independent provisioning so as to maximize throughput. The throughput is the maximum multiplier λ such that all matrices in λ · T (R, C) can be feasibly routed with protection against
router node failures under the scheme. We first give an alternative (but equivalent) definition of throughput and then present a linear programming formulation for maximizing it. Because other intermediate nodes should be able to take up the traffic split ratios of the failed node f , a necessary and sufficient condition for the αj0 values to be feasible is X
αj0 ≥ 1 ∀ f ∈ N
j∈N,j6=f
It suffices to have the minimum of these n sums exactly equal to 1. Thus, in the case that this is not so, the traffic split ratios can be divided by X
λ = min f ∈N
αj0
j∈N,j6=f
(normalized) to achieve the desired result – in which case all traffic matrices in λ · T (R, C) can be feasibly routed. Thus, the appropriate measure of throughput in this case is the quantity λ as defined above. We adopt the standard network flow terminology from [1]. Let xij e denote the flow on link e 0 0 for routing the demand of αj Ri + αi Cj from node i to node j. Then, the problem of two-phase routing with failure independent provisioning so as to maximize the network throughput can be formulated as the following linear program:
maximize λ subject to λ
≤
X
αj0 ∀ f ∈ N
(2)
j∈N,j6=f
X e∈E + (k)
xij e −
X
xij e
e∈E − (k)
X
xij e
0 0 αj Ri + αi Cj 0 = −αj Ri − αi0 Cj 0 ≤ ue ∀ e ∈ E
if k = i if k = j
∀ i, j, k ∈ N
(3)
otherwise (4)
i,j∈N
This linear program can be solved in polynomial using a general linear programming algorithm [17]. Let λU N P and λF IP denote the maximum throughputs respectively for two-phase routing for the unprotected case and for router node failure protection using failure independent provisioning. The following theorem (proved in the Appendix) establishes an upper bound on λF IP relative to λU N P .
Theorem 1: The throughput λF IP of two-phase routing with failure independent provisioning to protect against router node failures is at most n−1 times the throughput λU N P for the n unprotected case. As we will see in Section IX, the ratio λλUFNIPP is within 2% of this theoretical upper bound for seven out of the nine evaluated topologies. We provide an intuitive explanation that gives the quantity n−1 as a rough theoretical estimate n λF IP of the ratio λU N P . The average split ratio per node under normal (no-failure) conditions is n1 . When a router node fails, each of the other n − 1 nodes takes up on the average an additional 1 split ratio of n(n−1) . Thus, the estimate for the ratio λλUFNIPP is 1 n
1 n
+
1 n(n−1)
=
n−1 n
VIII. M AXIMIZING T HROUGHPUT FOR FAILURE D EPENDENT P ROVISIONING Given a network with link capacities ue and constraints Ri , Ci on the ingress-egress traffic, we consider the problem of routing with protection against router node failures through failure dependent provisioning so as to maximize throughput. For failure dependent provisioning, we work explicitly with both normal (no-failure) traffic split ratios αj and failure redistribution ratios βjf . Suppose we relax the requirement that the traffic split ratios αj sum to 1 in a feasible solution of the problem. Consider the sum λ=
X
αi
i∈N
The traffic split ratios αj can be normalized (divided) by λ so that they sum to 1, in which case all matrices in λ · T (R, C) can be feasibly routed. (The failure redistribution ratios βjf values are scaled by the same amount.) Thus, the appropriate measure of throughput in this case is the quantity λ above when the traffic split ratios αj are not constrained to sum to 1. We first present a path flow based linear programming formulation for this problem. This will be subsequently used to develop the fast combinatorial algorithm (FPTAS) in Section VIII-B. A. Path Indexed Linear Programming Formulation Let x(P ) denote the flow on path P under normal (no-failure) conditions. Let yf (P ) be the restoration flow that appears on path P after failure of node f . Let Pij denote the set of all paths from node i to node j. Then, the problem of two-phase routing with failure dependent provisioning so as to maximize the network throughput can be formulated as the following path-indexed linear program:
maximize
P i∈N
αi
subject to X
x(P ) = αj Ri + αi Cj ∀ i, j ∈ N
(5)
P ∈Pij
X
yf (P ) = βjf Ri + βif Cj ∀ i, j, f ∈ N
(6)
P ∈Pij
X
X
βjf = αf ∀ f ∈ N
j∈N,j6=f
X
x(P ) +
i,j P ∈Pij ,P 3e
X
X
yf (P ) ≤ ue ∀ e ∈ E, f ∈ N
(7) (8)
i,j P ∈Pij ,P 3e
αi , βif ≥ 0 ∀ i, f ∈ N x(P ), yf (P ) ≥ 0 ∀ P ∈ Pij , ∀ i, j, f ∈ N
(9) (10)
In general, a network can have an exponential number of paths (in the size of the network). Hence, this linear program can have possibly exponential number of variables and is not suitable for running on medium to large sized networks. The path-indexed formulation can be converted to a polynomial size link-indexed program, thus allowing it to be solved in polynomial time using a general linear programming algorithm [17]. We omit this for lack of space. It is well known that running times of general linear programming based algorithms for network problems do not scale well with increasing network size. In Section VIII-B, we state the dual of the above path-indexed linear program. The usefulness of the primal and dual formulation is in designing a fast combinatorial algorithm for the problem. B. Combinatorial Algorithm In this section, we develop a fast combinatorial algorithm (FPTAS) for failure dependent provisioning. We begin with the dual formulation of the above linear program. The primal-dual approach we develop is adapted from the technique in Garg and K¨onemann [6] for solving the maximum multicommodity flow problem, where flows are augmented in the primal solution and dual variables are updated in an iterative manner. The dual formulation of the linear program associates a variable πij with each demand constraint in (5), a variable γijf with each demand constraint in (6), a variable σf with each split redistribution constraint in (7), and a non-negative variable w(e, f ) with each link capacity constraint in (8).
Ri i Pi
P'i
k
f
Q'j
Qj
j Cj Fig. 4.
One Step in the Primal-Dual Computation.
For each node i, j ∈ N , denote by SP (i, j) the cost of the shortest path from node i to node P j under link costs c(e) = f ∈N w(e, f ) ∀ e ∈ E. That is, SP (i, j) = min
P ∈Pij
X X
w(e, f )
e∈P f ∈N
Also, let SPf (i, j) denote the cost of the shortest path from node i to node j under link costs c(e) = w(e, f ) ∀ e ∈ E. That is, SPf (i, j) = min
P ∈Pij
X
w(e, f )
e∈P
We define two more quantities before arriving at the dual linear program formulation. For given weights w(e, f ) and any node k ∈ N , we define V (k) as V (k) =
X
Ri SP (i, k) +
i:i6=k
X
Cj SP (k, j) ∀ k ∈ N
j:j6=k
and, for any k, f ∈ N, k 6= f , we define W (k, f ) as W (k, f ) =
X
Ri SPf (i, k) +
i:i∈{k,f / }
X
Cj SPf (k, j)
j:j ∈{k,f / }
∀ k, f ∈ N, k 6= f After simplification and removal of the dual variables πij , γijf , and σf , the dual linear program can be written as:
minimize
X X e∈E f ∈N
ue w(e, f )
subject to V (f ) + W (k, f ) ≥ 1 ∀ k, f ∈ N, k 6= f w(e, f ) ≥ 0 ∀ e ∈ E, f ∈ N
(11) (12)
Given any set of weights w(e, f ), note that the quantities V (k) and W (k, f ) above can be computed in polynomial time by simple shortest path computations. Let U (k, f ) denote the left-hand-side (LHS) of constraint (11) for any k, f ∈ N, k 6= f , that is U (k, f ) = V (f ) + W (k, f ) ∀ k, f ∈ N, k 6= f The algorithm works as follows. Start with initial weights w(e, f ) = uδe ∀ e ∈ E, f ∈ N (the quantity δ depends on ² and is derived later). Repeat the following until the dual objective function value becomes greater than 1: 1) Compute nodes f = f¯ and k = k¯ for which U (k, f ) is minimum. This also identifies (i) paths Pi from node i to node f¯ for all i 6= f¯, (ii) paths Qj from node f¯ to node j for ¯ f¯}, and (iv) paths Q0 from all j 6= f¯, (iii) paths Pi0 from node i to node k¯ for all i ∈ / {k, j ¯ ¯ ¯ node k to node j for all j ∈ / {k, f }. (These are the corresponding shortest paths used in ¯ ¯ evaluating U (k, f ) as described above.) This is illustrated in Figure 4. 2) For a traffic split ratio of 1 for intermediate node f¯, the traffic on path Pi is Ri for all i 6= f¯ and the traffic on path Qj is Cj for all j 6= f¯. Using this, compute the traffic s(e) on link e under normal (no-failure) conditions per unit split ratio αf¯ for intermediate node f¯ as s(e) =
X
Ri +
i6=f¯,Pi 3e
X
Cj ∀ e ∈ E
(13)
j6=f¯,Qj 3e
3) For every unit of traffic split ratio for intermediate node f¯ that is redistributed to inter¯ f¯} mediate node k¯ when router node f¯ fails, the traffic on path Pi0 is Ri for all i ∈ / {k, ¯ f¯}. Using this, compute the traffic s0 (e, f ) and the traffic on path Q0j is Cj for all j ∈ / {k, that appears on link e after failure of router node f¯ per unit split ratio αf¯ of intermediate node f¯ that is redistributed to intermediate node k¯ as s0 (e) =
X ¯ f¯},P 0 3e i∈{ / k, i
Ri +
X
Cj ∀ e ∈ E
(14)
¯ f¯},Q0 3e j ∈{ / k, j
4) Compute the maximum value α for the traffic split ratio for intermediate node f¯ that is redistributed to intermediate node k¯ after failure of router node f¯ such that for every link e, the sum of working flow and restoration flow due to failure of router node f¯ sent during
this iteration is at most the link capacity ue . The working flow uses intermediate node f¯ and is sent along paths Pi , Qj and the restoration flow uses intermediate node k¯ and is sent along paths Pi0 , Q0j . The maximum value of α is given by ue α = min (15) e∈E s(e) + s0 (e) 5) For this value α of the traffic split ratio for intermediate node f¯, send αRi amount of working flow on path Pi for all i 6= f¯ and αCj amount of working flow on path Qj for all j 6= f¯. Compute the total working flow on link e as ∆(e) = αs(e) for all e ∈ E. 6) The traffic split ratio α for intermediate node f¯ is redistributed to intermediate node k¯ after ¯ f¯} / {k, failure of router node f¯. Send αRi amount of restoration flow on path Pi0 for all i ∈ ¯ f¯}. Compute the total and αCj amount of restoration flow on path Q0j for all j ∈ / {k, restoration flow that appears on link e after failure of router node f¯ as ∆0 (e, f¯) = αs0 (e) for all e ∈ E. 7) For each e ∈ E, update weights w(e, f¯) as Ã
²[∆(e) + ∆0 (e, f¯)] w(e, f¯) ← w(e, f¯) 1 + ue
!
8) For each e ∈ E, f ∈ N, f 6= f¯, update weights w(e, f ) as Ã
²∆(e) w(e, f ) ← w(e, f ) 1 + ue
!
9) Increment by α both the traffic split ratio αf¯ associated with intermediate node f¯ and the redistribution ratio βk¯f¯ to intermediate node k¯ after failure of router node f¯. When the above procedure terminates, primal link capacity constraints will be violated, since we were working with the original (and not residual) link capacities at each stage. To remedy this, we scale down the normal split ratios αi and and failure redistribution ratios βjf uniformly so that capacity constraints are obeyed. Note that since the algorithm maintains primal and dual solutions at each step, the optimality gap can be estimated by computing the ratio of the primal and dual objective function values. The computation can be terminated immediately after the desired closeness to optimality is achieved. The pseudo-code for the above procedure, called Algorithm FDP (for Failure Dependent Provisioning), is provided in the box below. Arrays work(e) and bkp(e, f ) keep track respectively of the working traffic on link e and the restoration traffic that appears on link e due to failure of router node f as the algorithm progresses. The variable D is initialized to 0 and remains less than 1 as long as the dual objective function value is less than 1. After the while loop terminates, the factor by which the capacity constraint on each link e gets violated is computed into array
scale(e). Finally, the αi and βjf values are divided by the maximum capacity violation factor and the resulting values output. Algorithm FDP: αk ← 0 ∀ k ∈ N ; βkf ← 0 ∀ k, f ∈ N, k 6= f ; w(e, f ) ← uδe ∀ e ∈ E, f ∈ N ; work(e) ← 0 ∀ e ∈ E ; bkp(e, f ) ← 0 ∀ e ∈ E, f ∈ N ; D←0; while D < 1 do P For each i, j ∈ N , compute shortest path from i to j under link costs c(e) = f ∈N w(e, f ) and denote its cost by SP (i, j) ; For each i, j ∈ N , compute shortest path from i to j under link costs c(e) = w(e, f ) for each f ∈ N and denote its cost by SPf (i, j) ; P P U (k, f ) ← i6=k Ri SP (i, k) + j6=k Cj SP (k, j)+ P P i∈{k,f / } Ri SPf (i, k) + j ∈{k,f / } Cj SPf (k, j) ∀ k, f ∈ N, k 6= f ; ¯ f¯) ← arg mink,f ∈N,k6=f U (k, f ) ; (k, (This identifies paths Pi , Qj , Pi0 , and Q0j as defined earlier.) P P s(e) = i6=f¯,Pi 3e Ri + j6=f¯,Qj 3e Cj ∀ e ∈ E ; P P s0 (e) = i∈{ ¯ f¯},Q0 3e Cj ∀ e ∈ E ; ¯ f¯},P 0 3e Ri + j ∈{ / k, / k, j i α ← mine∈E
ue s(e)+s0 (e)
;
∆(e) ← αs(e) ∀ e ∈ E ; ∆0 (e, f¯) ← αs0 (e) ∀ e ∈ E ; work(e) ← work(e) + ∆(e) ∀ e ∈ E ; bkp(e, f¯) ← bkp(e, f¯) + ∆0 (e, f¯) ∀ e ∈ E ; w(e, f ) ← w(e, f )(1 + ²∆(e) ) ∀ e ∈ E, ∀ f ∈ N, f 6= f¯ ; ue 0 (e,f¯)] w(e, f¯) ← w(e, f¯)(1 + ²[∆(e)+∆ ) ∀e∈E ; ue αf¯ ← αf¯ + α ; βk¯f¯ ← βk¯f¯ + α ; P P D ← e∈E f ∈N ue w(e, f ) ; end while bkp max(e) ← maxf ∈N bkp(e, f ) ∀ e ∈ E ;
max(e) scale(e) ← work(e)+bkp ∀e∈E ; ue scale max ← maxe∈E scale(e) ; αk ← scaleαkmax for all k ∈ N ; β βkf ← scalekfmax for all k, f ∈ N, k 6= f ; Output traffic split ratios αk and failure redistribution ratios βkf ;
The values of ² and δ are related, in the following theorem, to the approximation factor guarantee of Algorithm FDP. Theorem 2: For any given ²0 > 0, Algorithm FDP computes a solution with objective function value within (1 + ²0 )-factor of the optimum for δ=
1+² 1 √ and ² = 1 − [(1 + ²)m]1/² 1 + ²0
We end this section with a bound on the running time of Algorithm FDP. Theorem 3: For any given ² > 0 chosen to provide the desired approximation factor guarantee in accordance with Theorem 2, Algorithm FDP runs in time µ
1 O 2 n3 m(m + n log n) log nm ²
¶
which is strongly polynomial. Proofs of the above theorems are provided in the Appendix. IX. E VALUATION
ON
ISP T OPOLOGIES
In this section, we evaluate the throughput performance of the two schemes for protecting against router node failures in two-phase routing. In order to make exact comparisons with the throughput for the unprotected case, we compute the throughput for both the schemes using the link flow based linear programming formulations developed in this paper. For the throughput of the unprotected scheme, we use the linear programming formulation from [9]. We use CPLEX [22] to solve all linear programs. A. Topologies and Link/Ingress-Egress Capacities For our experiments, we use six ISP topologies collected by Rocketfuel, an ISP topology mapping engine [14]. These topologies list multiple intra-PoP (Point of Presence) routers and/or multiple intra-city PoPs as individual nodes. We coalesced PoPs into nodes corresponding to cities so that the topologies represent geographical PoP-to-PoP ISP topologies. Some data about the original Rocketfuel topologies and their coalesced versions is provided in Table I.
Topology
Routers
Links
PoPs
Links
(original)
(inter-router)
(coalesced)
(inter-PoP)
Telstra (Australia) 1221
108
306
57
59
Sprintlink (US) 1239
315
1944
44
83
Ebone (Europe) 1755
87
322
23
38
Tiscali (Europe) 3257
161
656
50
88
Exodus (Europe) 3967
79
294
22
37
Abovenet (US) 6461
141
748
22
42
TABLE I ROCKETFUEL TOPOLOGIES : O RIGINAL NUMBER OF ROUTERS AND INTER - ROUTER LINKS , AND NUMBER OF COALESCED P O P S AND INTER -P O P LINKS .
Fig. 5.
Three research network topologies.
We also use three research network topologies, namely, the UK research network JANET [23], the US research backbone ABILENE [24], and the European research network GEANT [25]. These are shown in Figure 5. The Rocketfuel topologies are router-level (IP layer) topologies. The PoP-to-PoP topologies we obtained as above all have average node degrees less than 4. Physical WDM topologies of ISPs are characterized by small average node degrees (typically less than 4). Assuming that all physical WDM links appear in the IP topology, it is conceivable that “most” of the links in the PoP-to-PoP Rocketfuel topologies correspond to physical WDM links (instead of multihop physical layer paths). We make this assumption for our experiments with IP-over-Optical networks. ISPs regard their topologies as proprietary information – we are not aware of other credible sources for information about actual ISP topologies. Link capacities, which are required to compute the maximum throughput, are not available for these topologies. Rocketfuel computed OSPF/IS-IS link weights for the topologies so that
shortest cost paths match observed routes. In order to deduce the link capacities from the weights, we assumed that the given link weights are the default setting for OSPF weights in Cisco routers, i.e., inversely proportional to the link capacities [3]. The link capacities obtained in this manner turned out to be symmetric, i.e., uij = uji for all (i, j) ∈ E. There is also no available information on the ingress-egress traffic capacities at each node for the Rocketfuel topologies. Because ISPs commonly engineer their PoPs to keep the ratio of add/drop and transit traffic approximately fixed, we assumed that the ingress-egress capacity at a node is proportional to the total capacity of network links incident at that node. We also assume that Ri = Ci for all nodes i – since network routers and switches have bidirectional ports (line P cards), hence the ingress and egress capacities are equal. Thus, we have Ri (= Ci ) ∝ e∈E + (i) ue . In contrast, the three research network topologies are physical WDM topologies. Link capacities are available for all three networks. Ingress-egress capacities are available for the JANET topology only. B. Experiments and Results We denote the throughput values for the unprotected and router node failure protected versions of two-phase routing as follows: (i) λU N P for unprotected, (ii) λF IP for protecting router node failures with failure independent provisioning, and (iii) λF DP for protecting router node failures with failure dependent provisioning. Clearly, λU N P > λF DP ≥ λF IP (the last inequality follows from the nature of failure dependent provisioning as explained in Section VI-B). We are also interested in the number of intermediate nodes i with non-zero traffic split ratios αi (under normal no-failure conditions), which we denote for the three cases by NU N P , NF IP , and NF DP respectively. For the Sprintlink 1239 and Tiscali 3257 topologies, the CPLEX processes for solving the linear program for failure dependent provisioning ran out of memory and were killed on a 2.4GHz Dual Xeon machine with 1GB of RAM and running Linux. This was the fastest machine with the highest RAM that we had access to for running CPLEX. Hence, for these two topologies, we used the combinatorial algorithm to obtain solutions within 5% of optimality. The corresponding entries for the two topologies are marked with an asterisk (∗ ) in Tables II and III respectively. For these two topologies, the value of λF DP , computed approximately as described above, turned out to be less than λF IP . 1) Throughput: In Table II, we list the relative increase in throughput of two-phase routing, compared to the unprotected case, for the two schemes for protecting against router node failures for the six Rocketfuel topologies and three research network topologies. We also list the closeness of the ratio λλUFNIPP to the theoretical upper bound of n−1 – for seven topologies, this is less than n
λF IP λU N P
Topology
λF DP λU N P
n−1 n
Closeness of λF IP λU N P
to
n−1 n
λF DP λF IP
Telstra (Australia) 1221
0.9744
0.9750
0.9825
0.82%
1.0006
Sprintlink (US) 1239
0.9683
0.9294∗
0.9773
0.92%
0.9598∗
Ebone (Europe) 1755
0.9500
0.9524
0.9565
0.68%
1.0025
Tiscali (Europe) 3257
0.9677
0.9293∗
0.9800
1.26%
0.9603∗
Exodus (Europe) 3967
0.9375
0.9412
0.9545
1.79%
1.0039
Abovenet (US) 6461
0.9369
0.9448
0.9545
1.85%
1.0084
JANET (8-node)
0.8671
0.9100
0.8750
0.90%
1.0495
ABILENE (11-node)
0.7979
0.8182
0.9091
12.23%
1.0254
GEANT (33-node)
0.9224
0.9310
0.9697
4.88%
1.0093
TABLE II T HROUGHPUT OF T WO -P HASE ROUTING FOR FAILURE INDEPENDENT (λF IP ) AND FAILURE DEPENDENT (λF DP ) PROVISIONING SCHEMES FOR PROTECTING AGAINST ROUTER NODE FAILURES COMPARED TO UNPROTECTED CASE
Topology
OF IP
Telstra (Australia) 1221
2.56%
2.50%
Sprintlink (US) 1239
3.17%
7.06%∗
Ebone (Europe) 1755
5.00%
4.76%
Tiscali (Europe) 3257
3.23%
7.07%∗
Exodus (Europe) 3967
6.25%
5.88%
Abovenet (US) 6461
6.31%
5.52%
JANET (8-node)
13.29%
9.00%
ABILENE (11-node)
20.21%
18.18%
GEANT (33-node)
7.76%
6.90%
(λU N P ).
OF DP
TABLE III OVERHEAD OF FAILURE INDEPENDENT (OF IP ) AND FAILURE DEPENDENT (OF DP ) PROVISIONING SCHEMES FOR PROTECTING AGAINST ROUTER NODE FAILURES COMPARED TO UNPROTECTED CASE FOR
T WO -P HASE ROUTING .
2%. The throughput of failure dependent provisioning (λF DP ) is higher than that of failure independent provisioning (λF IP ) by less than 5% for the seven topologies for which we could compute λF DP exactly by solving the corresponding linear programs in CPLEX. The overhead of protecting against router node failures can be measured by the percentage decrease in network throughput over that for the unprotected case. For failure independent F IP provisioning, this is OF IP = λU NλPU−λ . For failure dependent provisioning, this is OF DP = NP λU N P −λF DP . These values are listed in Table III. For both failure independent and failure depenλU N P dent provisioning schemes, the overhead range is about 2-8% for seven of the topologies. (It is somewhat higher for the JANET and ABILENE topologies because these have small number of nodes.) Thus, it is relatively inexpensive to provide resiliency against router node failures in two-phase routing.
Topology
NU N P
NF IP
NF DP
Telstra (Australia) 1221
1
38
38
Sprintlink (US) 1239
5
30
24
Ebone (Europe) 1755
4
19
19
Tiscali (Europe) 3257
7
30
29
Exodus (Europe) 3967
3
16
16
Abovenet (US) 6461
7
17
18
TABLE IV N UMBER OF I NTERMEDIATE N ODES IN T WO -P HASE ROUTING FOR UNPROTECTED (NU N P ), AND FAILURE INDEPENDENT (NF IP ), FAILURE DEPENDENT (NF DP ) PROVISIONING SCHEMES FOR PROTECTING AGAINST ROUTER NODE FAILURES .
We also observe that the overhead of failure dependent provisioning is only marginally lower than that for failure independent provisioning (less than 5%) for the seven topologies for which we could compute λF DP exactly. Hence, given the static optical layer provisioning property of failure independent provisioning in handling router node failures, it might be the preferred one among the two schemes. 2) Number of Intermediate Nodes: In Table IV, we list the number of intermediate nodes i with non-zero traffic split ratios (under normal no-failure conditions) for the three cases for the six Rocketfuel topologies. In our experiments with both the schemes for protecting against router node failures, we observed that some intermediate nodes have quite small αi values – they carry less than a percentage point of total network traffic. In practice, the traffic split ratios associated with these intermediate nodes can be redistributed to other nodes without any significant decrease in throughput. Hence, in Table IV, for the NF IP and NF DP values, we plot the number of intermediate nodes with the largest αi values (normalized) that sum to at least 0.95, i.e., the nodes with largest traffic split ratios that together carry at least 95% of total network traffic. Interestingly, the number of intermediate nodes increases when we provide protection against router node failures (it is almost always the same for the failure independent and failure dependent schemes). The explanation is as follows. For the unprotected case, a small number of intermediate nodes is chosen by the algorithm for maximizing throughput – these nodes are presumably located at geographical center(s) of the network and provide the best opportunities for serving as intermediate nodes without increasing the length of end-to-end paths or decreasing throughput significantly. However, a small number of intermediates nodes is associated with a relatively large traffic split ratio for each such node. Thus, in the event of a route node failure at any of these intermediate nodes, a relatively large fraction of the traffic needs to be restored, thus increasing the resources reserved for restoration. Hence, in an effort to maximize the throughput, the algorithms developed in this paper intelligently spread the traffic to many intermediate nodes
so as to prevent any single traffic split ratio from becoming too large. X. C ONCLUSION The two-phase routing scheme was recently proposed for handling highly variable traffic and allows preconfiguration of the network such that all traffic patterns, permissible within the network’s natural ingress-egress capacity constraints, can be handled in a capacity efficient manner without the necessity to detect any traffic changes in real-time. The scheme has the desirable property of supporting static optical layer provisioning in IP-over-Optical networks. For the Rocketfuel topologies, the throughput of two-phase routing is within 6% of the optimal scheme among the class of all schemes that are allowed to reconfigure the network in response to traffic changes [9]. In this paper, we considered the realization of two-phase routing in IP-over-OTN networks where routers are interconnected over a switched optical backbone. Given the high unreliability of routers, we proposed two different schemes for making two-phase routing in IP-over-OTN resilient against router node failures – one that is failure node independent and static, and the other that is failure node dependent and requires limited optical layer reconfiguration after failure. We developed linear programming formulations for both schemes and a fast combinatorial algorithm for the second scheme so as to maximize network throughput. In each case, we determine (i) the optimal distribution of traffic to various intermediate routers for both normal (no-failure) and failure conditions, and (ii) provisioning of optical layer circuits to provide the needed interrouter links. We view this as important progress for two-phase routing in IP-over-OTN towards achieving carrier-class reliability so as to facilitate its future deployment in ISP networks. One would ideally like the network to be quasi-static in its configuration and not require frequent adaptation to network events. ISPs typically use a combination of overprovisioning and dynamic network adaptation to avoid network congestion caused by unpredicted events. However, both overprovisioning and frequent adaptation lead to increased costs. In particular, frequent adaptation incurs high operational costs and risks further instability elsewhere in the network. Two-phase routing, with its extensions for router failure resiliency, can handle extreme traffic variability and router failures in a network with an almost static network configuration and without requiring high capacity overprovisioning. The ability to handle traffic variation without almost no routing adaptation will lead to more stable and robust Internet behavior. We evaluated the throughput performance of two-phase routing with the two proposed schemes for protecting against router node failures on actual ISP topologies and compared it with the unprotected case. We conclude that it is relatively inexpensive to provide resiliency against router node failures int two-phase routing. Also, the overhead of failure dependent provisioning is only marginally lower than that for failure independent provisioning. Hence, given the static
optical layer provisioning property of failure independent provisioning in handling router node failures, it might be the preferred one among the two proposed schemes. We also rationalized the observation that when maximizing throughput, the number of intermediate nodes is significantly higher for protecting against router node failures compared to the unprotected case. R EFERENCES [1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows: Theory, Algorithms, and Applications, Prentice Hall, February 1993. [2] Y. Azar, E. Cohen, A. Fiat, H. Kaplan, and H. R¨acke, “Optimal oblivious routing in polynomial time”, 35th ACM Symposium on the Theory of Computing (STOC), 2003. [3] “Configuring OSPF”, Cisco Systems Product Documentation, http://www.cisco.com/univercd/home/home.htm. [4] N. G. Duffield, P. Goyal, A. G. Greenberg, P. P. Mishra, K. K. Ramakrishnan, J. E. van der Merwe, “A flexible model for resource management in virtual private network”, ACM SIGCOMM 1999, August 1999. [5] J. A. Fingerhut, S. Suri, and J. S. Turner, “Designing Least-Cost Nonblocking Broadband Networks”, Journal of Algorithms, 24(2), pp. 287-309, 1997. [6] N. Garg and J. K¨onemann, “Faster and Simpler Algorithms for Multicommodity Flow and other Fractional Packing Problems”, 39th Annual Symposium on Foundations of Computer Science (FOCS), 1998. [7] A. Kumar, R. Rastogi, A. Silberschatz , B. Yener, “Algorithms for provisioning VPNs in the hose model”, ACM SIGCOMM 2001, August 2001. [8] M. Kodialam, T. V. Lakshman, and S. Sengupta, “Efficient and Robust Routing of Highly Variable Traffic”, Third Workshop on Hot Topics in Networks (HotNets-III), November 2004. [9] M. Kodialam, T. V. Lakshman, J. B. Orlin, and S. Sengupta, “A Versatile Scheme for Routing Highly Variable Traffic in Service Overlays and IP Backbones”, IEEE Infocom 2006, April 2006. [10] M. Kodialam, T. V. Lakshman, J. B. Orlin, and S. Sengupta, “Preconfiguring IP-over-Optical Networks to Handle Router Failures and Unpredictable Traffic”, IEEE Infocom 2006, April 2006. [11] M. Kodialam, T. V. Lakshman, and S. Sengupta, “Maximum Throughput Routing of Traffic in the Hose Model”, IEEE Infocom 2006, April 2006. [12] M. Kodialam, T. V. Lakshman, and Sudipta Sengupta, “Throughput Guaranteed Restorable Routing Without Traffic Prediction”, IEEE ICNP 2006, November 2006. [13] C. Labovitz, A. Ahuja, and F. Jahanian, ”Experimental Study of Internet Stability and Backbone Failures”, Proceedings of 29th International Symposium on Fault-Tolerant Computing (FTCS), Madison, Wisconsin, pp. 278-285, June 1999. [14] N. Spring, R. Mahajan, D. Wetherall, and T. Anderson, “Measuring ISP Topologies with Rocketfuel”, IEEE/ACM Transactions on Networking, vol. 12, no. 1, pp. 2-16, February 2004. [15] A. Medina, N. Taft, K. Salamatian, S. Bhattacharyya, C. Diot, “Traffic Matrix Estimation: Existing Techniques and New Directions”, ACM SIGCOMM 2002, August 2002. [16] M. Reardon and S. Saunders, ”Terabit Trouble”, Data Communications, August 1999, pp. 11-16. [17] A. Schrijver, Theory of Linear and Integer Programming, John Wiley & Sons, 1986. [18] S. Sengupta, D. Saha, and V. P. Kumar, “Switched Optical Backbone for Cost-effective Scalable Core IP Networks”, IEEE Communications Magazine, June 2003. [19] I. Stoica, D. Adkins, S. Zhuang, S. Shenker, S. Surana, “Internet Indirection Infrastructure”, ACM SIGCOMM 2002, August 2002. [20] R. Zhang-Shen and N. McKeown “Designing a Predictable Internet Backbone Network”, Third Workshop on Hot Topics in Networks (HotNets-III), November 2004. [21] Y. Zhang, M. Roughan, C. Lund, and D. Donoho, “An Information-Theoretic Approach to Traffic Matrix Estimation”, ACM SIGCOMM 2003, pp. 301-312, August 2003.
[22] ILOG CPLEX, http://www.ilog.com. [23] JANET, http://www.ja.net. [24] ABILENE, http://abilene.internet2.edu. [25] GEANT, http://www.geant.net.
A PPENDIX A. Proof of Theorem 1 Let αi0 be the traffic split ratios in an optimal solution for the linear program of Section VII for maximizing throughput in failure independent provisioning. Summing constraints (2) for all f ∈ N , we have nλF IP ≤ (n − 1)
X
αi0
i∈N
In the linear programming formulation for maximizing throughput for the unprotected case [9], the throughput is the sum of the traffic split ratios (when they are not constrained to sum to 1). Also, the linear program for failure independent provisioning differs from that for the unprotected case only in the definition of throughput (in the former case, the throughput is the minimum value of the n possible sums of any n − 1 of the traffic split ratios). Thus, we must have X
αi0 ≤ λU N P
i∈N
Using this in the first inequality above, we obtain the upper bound claimed in the theorem. B. Proofs of Theorems 2 and 3 In this section, we provide proofs for Theorems 2 and 3 from Section VIII for the approximation factor guarantee and running time of Algorithm FDP. We begin with some notation, then state some useful lemmas, and finally conclude with the proofs of the main theorems. Given a set of dual weights w(e, f ), let D(w) denote the dual objective function value and let Γ(w) denote the minimum value of the LHS of dual program constraint (11) over all nodes k, f ∈ N, k 6= f . Then, solving the dual program is equivalent to finding a set of weights w(e, f ) such that D(w)/Γ(w) is minimized. Denote the optimal objective function value of the latter by θ, i.e., θ = minw D(w)/Γ(w). We introduce some more notation before stating an important lemma. Let wt−1 denote the weight function at the beginning of iteration t of the while loop, and let At−1 be the value P of j∈N αj (primal objective function) up to the end of iteration t − 1. Suppose the algorithm terminates after iteration L.
Lemma 1: At the end of every iteration t, 1 ≤ t ≤ L, of Algorithm FDP, the following holds t Y
² [1 + (Aj − Aj−1 )] θ j=1 ¯ ¯ Proof: Let f , k ∈ N be the nodes for which LHS of dual constraint (11) is minimum and let Pi , Qj , Pi0 , Q0j be the corresponding paths (as defined earlier) along which flow is augmented during iteration t. Recall that the weights are updated as: D(wt ) ≤ mnδ
Ã
²∆(e) wt (e, f ) ← wt−1 (e, f ) 1 + ue
!
∀ e ∈ E, ∀ f ∈ N, f 6= f¯
Ã
!
²[∆(e) + ∆0 (e, f¯)] ¯ ¯ wt (e, f ) ← wt−1 (e, f ) 1 + ∀e∈E ue where ∆(e) is the total working flow on link e, and ∆0 (e, f¯) is the total restoration flow on link e due to failure of router node f¯ (both sent during iteration t). Using this, we have X
D(wt ) =
ue wt (e, f )
e∈E,f ∈N
X
=
X
wt−1 (e, f )∆(e) +
e∈E,f 6=f¯
e∈E,f ∈N
²
X
ue wt−1 (e, f ) + ²
wt−1 (e, f¯)[∆(e) + ∆0 (e, f¯)]
e∈E
X
= D(wt−1 ) + ²
e∈E,f ∈N
X
= D(wt−1 ) + ² X
¯ f¯},P 0 3e i∈{ / k, i
e∈E
X
= D(wt−1 ) + ²α X
wt−1 (e, f¯)[
X
X
Ri +
Cj ] +
j6=f¯,Qj 3e
X
Ri +
¯ f¯},P 0 3e i∈{ / k, i
e∈E
X
i6=f¯,Pi 3e
e∈E,f ∈N
²α
αCj ]
¯ f¯},Q0 3e j ∈{ / k, j
wt−1 (e, f )[
αCj ] +
j6=f¯,Qj 3e
X
αRi +
X
αRi +
i6=f¯,Pi 3e
X
wt−1 (e, f¯)[
wt−1 (e, f¯)∆0 (e, f¯)
e∈E
X
wt−1 (e, f )[
e∈E,f ∈N
²
X
wt−1 (e, f )∆(e) + ²
Cj ]
¯ f¯},Q0 3e j ∈{ / k, j
We interchange the order of summation in the second and third terms on RHS of the above equation and first sum along links on paths Pi , Qj , and then over i, j respectively. Similarly, for the third term, we interchange the order of summation and first sum along links on paths Pi0 , Q0j , and then over i, j respectively. This gives D(wt ) = D(wt−1 ) + ²α[ X ¯ f¯} i∈{ / k,
Ri
X
X
i6=f¯
X
Ri
wt−1 (e, f ) +
e∈Pi ,f ∈N
wt−1 (e, f¯) +
e∈Pi0
¯ f¯) = D(wt−1 ) + ²αU (k,
X
¯ f¯} j ∈{ / k,
Cj
X
X j6=f¯
X
Cj
wt−1 (e, f ) +
e∈Qj ,f ∈N
wt−1 (e, f¯)]
e∈Q0j
(16)
= D(wi−1 ) + ²αΓ(wi−1 ) = D(wi−1 ) + ²(At − At−1 )Γ(wi−1 ) ¯ f¯ and associated paths Pi , Qj , P 0 , Q0 The step leading to (16) follows from the choice of nodes k, i j that minimize U (k, f ) for the set of weights wt−1 (e, f ) at the beginning of iteration t. Using this for each iteration down to the first one, we have D(wt ) = D(w0 ) + ²
t X
(Aj − Aj−1 )Γ(wj−1 )
(17)
j=1 j−1 ) , whence Γ(wj−1 ) ≤ 1θ D(wj−1 ). Also, D(w0 ) = From the definition of θ, we have θ ≤ D(w Γ(wj−1 ) mnδ. Using these in equation (17), we have
t ²X D(wt ) ≤ mnδ + (Aj − Aj−1 )D(wj−1 ) θ j=1
(18)
The property claimed in the lemma can now be proved using inequality (18) and mathematical induction on the iteration number t. We omit the details here, but point out that the induction basis case (iteration t = 1) holds since w0 (e, f ) = uδe ∀ e ∈ E, f ∈ N and D(w0 ) ≤ mnδ. We now estimate the factor by which the objective function value value AL in the primal solution needs to be scaled when the algorithm terminates so as to ensure that link capacity constraints are not violated. Lemma 2: When Algorithm FDP terminates, the primal solution needs to be scaled by a factor of at most log1+² 1+² to ensure primal feasibility. δ Proof: Consider any link e and router failure at some node f ∈ N . We will show that the working flow on link e plus the restoration flow on link e due to failure of router node f is at most ue when the primal solution is scaled by the above factor. The value of w(e, f ) changes (due to an update) when flow is augmented on edge e under either or both of the following circumstances: •
•
Link e appears on any of the paths Pi , Qj , in which case the flow is working traffic on this link, or Link e appears on any of the paths Pi0 , Q0j , in which case the flow appears as restoration traffic on link e after failure of router node f .
Let the sequence of flow augmentations (working plus restoration) on link e that require update P of weight w(e, f ) be ∆1 , ∆2 , . . . , ∆r , where r ≤ L. Let rt=1 ∆t = κue , i.e., the total flow (working traffic plus restoration traffic after failure of router node f ) routed on link e exceeds its capacity by a factor of κ. Because of the way in which α is chosen in accordance with equations (13)-(15), we have ∆i ≤ ue for all i. Hence, dual weights are updated by a factor of at most 1 + ² after each
iteration. Since the algorithm terminates when D(w) ≥ 1, and since dual weights are updated by a factor of at most 1 + ² after each iteration, we have D(wL ) < 1 + ². Since the weight w(e, f ), with coefficient ue , is one of the summing components of D(w), we have ue wL (e, f ) < 1 + ². Also, the value of wL (e, f ) is given by wL (e, f ) =
r δ Y ∆t (1 + ²) ue t=1 ue
Using the inequality (1 + cx) ≥ (1 + x)c ∀ x ≥ 0 and any 0 ≤ c ≤ 1 and setting x = ² and t c= ∆ ≤ 1, we have ue r 1+² δ Y > wL (e, f ) ≥ (1 + ²)∆t /ue ue ue t=1
Pr δ (1 + ²) t=1 ∆t /ue ue δ (1 + ²)κ = ue
=
whence, κ < log1+²
1+² δ
Proof of Theorem 2: Using Lemma 1 and the inequality 1 + x ≤ ex for all x ≥ 0, we have D(wt ) ≤ mnδ
t Y
²
e θ (Aj −Aj−1 )
j=1
= mnδe²At /θ The simplification in the above step uses telescopic cancellation of the sum (Aj − Aj−1 ) over j. Since the algorithm terminates after iteration L, we must have D(wL ) ≥ 1. Thus, 1 ≤ D(wL ) ≤ mnδe²AL /θ whence, ² θ ≤ 1 AL ln mnδ
(19)
From Lemma 2, the objective function value of the feasible primal solution after scaling is at least AL log1+² 1+² δ
The approximation factor for the primal solution is at most the gap (ratio) between the dual and primal solutions. Using the lower bound on the primal solution and inequality (19), this is at most ² log1+² 1+² θ δ ≤ 1 AL ln 1+² mnδ log 1+²
δ
=
ln 1+² ² δ 1 ln(1 + ²) ln mnδ
1 1 The quantity ln 1+² / ln mnδ equals 1−² for δ = (1 + ²)/[(1 + ²)nm]1/² . Using this value of δ, the δ approximation factor is upper bounded by ² 1 ≤ (1 − ²) ln(1 + ²) (1 − ²)2
Setting 1 + ²0 = 1/(1 − ²)2 and solving for ², we get the value of ² stated in the theorem. Proof of Theorem 3: We first consider the running time of each iteration of the algorithm during which nodes f¯, k¯ and associated paths Pi , Qj , Pi0 , Q0j are chosen to augment flow. Computation of the minimum value of U (k, f ) over all k, f ∈ N, k 6= f involves n all-pairs shortest path computations which can be implemented in O(n2 m + n3 log n) time using Dijkstra’s shortest path algorithm with Fibonacci heaps [1]. It can be verified that all other operations within an iteration are absorbed by the time taken for these n all-pairs shortest path computations, leading to a total of O(n2 (m + n log n)) time per iteration. We next estimate the number of iterations before the algorithm terminates. Recall that in each iteration, flow is augmented along paths Pi , Qj , Pi0 , Q0j corresponding to the maximum value of intermediate node split ratio α such that the working flow ∆(e) plus the restoration flow ∆0 (e, f¯) sent on link e during that iteration is at most ue . Thus, for at least one link e, the total flow sent equals ue and the weight w(e, f¯) increases by a factor of 1 + ². Accordingly, with each iteration, we can associate a weight w(e, f ) which increases by a factor of 1 + ². Consider the weight w(e, f ) for fixed e ∈ E, f ∈ N . Since w0 (e, f ) = uδe and wL (e, f ) < 1+² ue (as deduced in the proof of Lemma 2), the maximum number of times that this weight can be associated with any iteration is 1 1 1+² = (1 + log1+² nm) = O( log1+² nm) log1+² δ ² ² Since there are a total of nm weights w(e, f ), hence the total number of iterations is upper bounded by O( nm log1+² nm). Multiplying this by the running time per iteration, we obtain the ² overall algorithm running time as O( 1² n3 m(m + n log n) log1+² nm). Since ln(1 + ²) = Θ(²), this is O( ²12 n3 m(m + n log n) log nm).