Optical Approach for Efficient Access ... - CiteSeerX

0 downloads 0 Views 354KB Size Report
novel integrated IP / optical scheme for protecting against access router failures in IP ... access circuit is re-routed from the PAR to the BAR over a reconfigurable ...
An Integrated IP/Optical Approach for Efficient Access Router Failure Recovery Panagiotis Sebos, Jennifer Yates, Guangzhi Li, Dan Rubenstein and Monica Lazer AT&T Labs-Research, 180 Park Ave, Florham Park, NJ 07932, {psebos,jyates,gli}@research.att.com, [email protected] *EE Dept, Columbia Univ., 500W 120th Str., Room 1312, NY, NY 10027; [email protected]

„2003 Optical Society of America OCIS codes (060.4250) Networks; (060.4510) Optical communications

Abstract: We propose and experimentally demonstrate a novel IP / optical integrated architecture to cost effectively recover from access router failures in large ISP networks. 1. Motivation The reliability expectations for IP networks are rapidly changing as enterprises increasingly rely on IP to support business critical applications. Integrated IP and optical architectures have been widely proposed within the Research literature to recover from failures within Internet Service Provider’s (ISP) networks. However, little attention has been given to how integrated mechanisms can be used at the edge of an ISP’s network. In this paper, we propose a novel integrated IP / optical scheme for protecting against access router failures in IP networks – routers that are typically a single point of failure in today’s ISP’s networks. Fig. 1 illustrates a typical ISP network, showing both the BR transport and IP network elements. Customer routers connect BR BR to access routers (AR) in the ISP network using private line circuits, known as access circuits. These access circuits are BR BR IP Backbone carried over metropolitan and local area transport networks, AR AR AR which are typically implemented using SONET/SDH ring or Metro mesh networks or emerging carrier technologies such as Transport Customer facing IP interface Gigabit Ethernet. Protection from transport failures is typically achieved using ring or mesh restoration. Within an ISP’s network, failures are typically handled by Customer routers Transport ADM/XC re-routing traffic around the failed elements using IP routing protocols such as Open Shortest Path First (OSPF). However, hardware and software failures can cause the Fig. 1: Typical ISP Network Topology interfaces on ARs connected to customers to fail, or even entire ARs to fail. In such scenarios, customers connected to the impacted AR typically lose service until the AR is manually repaired. One approach used today to recover from such failures is for customers to be multi-homed to two or more ARs. However, this solution is expensive, as it requires use of multiple dedicated access circuits, and multiple ports on the customer routers and ARs. Multi-homing also introduces additional configuration complexity on customer routers. Although router vendors have implemented schemes to recover from interface and router failures without requiring multi-homing from customers [1], these schemes typically require dedicated backup resources, and are thus expensive, or require packet transport mechanisms, which are not widely used in carrier transport networks today. In [2], we addressed these issues for interface failures by demonstrating the use of integrated IP and optical schemes to provide 1:N interface protection for cost effective yet rapid recovery from AR interface failures. In 1:N interface protection, a single spare interface was used to protect against the failure of N working interfaces [2]. In this paper, we extend the basic concepts from [2] for 1:N interface protection to demonstrate a novel scheme to cost effectively recover from complete AR failures. Using this approach, a single backup router can be used to recover from the failure of any one of multiple working routers. The sharing of backup resources makes this scheme more economical than existing dedicated recovery mechanisms. More generally, restoration is actually performed on an individual interface basis which means that customers connected to a single failed router can be recovered across spare interfaces on multiple working routers, according to bandwidth and port availability. However, for simplicity, we discuss here the simplified case that N working routers are being protected by a single backup router. Beyond the clear economic and reliability advantages, the scheme proposed here is not restricted by the transport technology used for the access circuits. Additionally, it does not require any special configuration, hardware or software in the customer routers, although we propose some protocol extensions that can be used to reduce recovery times.

2. Proposed Scheme In our proposed integrated IP/optical AR recovery scheme, when a primary AR (PAR) fails, the reconfigurable optical/transport network and IP routing protocols are together used to move the impacted customer connections from the failed PAR to a backup AR (BAR). The BARs may be co-located with the PARs, or maybe in a geographically remote location, enabling recovery from catastrophic central office failures. The first step of the recovery process is for the BAR to detect the failure. There are multiple mechanisms that can be used for this, such as keep alive messages between the BAR and the PARs, or using existing routing protocols such as OSPF to inform the BAR of the failure. Once the PAR failure has been identified, the customer’s access circuit is re-routed from the PAR to the BAR over a reconfigurable optical network. Either the IP or optical layer can initiate the reconfiguration, potentially through tight integration using protocols such as the Optical Internetworking Forum (OIF)’s User to Network Interface (UNI) [3] or Generalized Multi-Protocol Label Switching (GMPLS). After the physical layer is established, the datalink layer must be resynchronized. The mechanisms used depend on the datalink technology (e.g., Packet Over SONET (POS), Gigabit Ethernet) [2]. Each AR contains significant amounts of customer-specific information, including IP addresses and security and routing policies. This customer-specific information must be instantiated and activated on the BAR – in ISP ARs with large numbers of customers, hardware limitations will likely mandate that this configuration must be transferred to the BAR after failure. This can be achieved using an external configuration server, which tracks the customer-specific information associated with ARs and downloads the configuration to the BAR upon failure. Once the customer-specific configuration is established on the BAR, we now need to ensure that customer traffic can be routed via the BAR. Each AR advertises the customers connected to it throughout the ISP’s network. This is typically achieved using the Border Gateway Protocol (BGP). BGP exchanges routing information so that the network is able to send traffic to the customer and vice versa. Once the customers are moved from the failed PAR to the BAR, the BAR must re-advertise these customers via BGP so that traffic can be sent to the customers via the BAR. In some cases, the customers also use BGP to communicate with the ISP’s ARs. Although this process will work with existing BGP implementations and configurations, we can optimize the BGP implementation used within the ISP to reduce the recovery times (e.g., by dynamically adjusting BGP’s LOCAL_PREF attribute in the BAR). Further reductions in the recovery times can be achieved if we can also adapt customer BGP implementations, by extending the BGP graceful re-start procedures currently being standardized [4]. The graceful re-start mechanisms are designed to avoid having BGP implementations drop all of the routing information when a failure is detected – we use it here so that the customer keeps the routing information from before the failure, reducing the amount of traffic lost. However, to make this work in our scenario, we had to extend the BGP re-start mechanism so that it would allow a different router to be used at the end of the BGP session – known as adjacencies (i.e., the customer is talking to a different router – the BAR - after it recovers physical connectivity). 3. Experimental Results To investigate the feasibility of our proposed IP / optical AR recovery mechanism, we implemented and experimentally configuration demonstrated our proposed scheme using the experimental IP Backbone server B1 AR4 setup depicted in Fig. 2. The backbone IP network, B3 C3 representative of an ISP's backbone, consists of four core AR1 AR2 routers and four ARs all running OSPF and BGP. We have AR3 three representative customer routers, C1, C2, and C3, two of transport network XC with GbE XCs which are connected to our ARs over a four-node C2 XC reconfigurable transport network. BGP is executed between Customer XC routers each customer and the AR to which it is connected. Our XC routers are all Linux PCs, as is our configuration server that is C1 Fig. 2: Experimental Setup connected to the ARs via fast Ethernet. The transport network used in our experiments consists of Gigabit Ethernet (GbE) cross-connects (XCs) and our prototype control plane [5]. Two PARs (AR1, AR2) and the BAR (AR3) are connected to a common XC. Bidirectional traffic sent between C1 and C3 is used to measure packet loss and thus estimate restoration times. Signaling between the ARs and their adjacent XCs is achieved through extensions to the OIF UNI 1.0 [3]. The restoration times involve the time for failure detection, establishment of physical connectivity between the customer and the BAR, downloading configuration state to the BAR and finally the re-routing of traffic via BGP. We failed AR1, which was detected by our BAR, AR3, via OSPF. The detection time depends on the network B2

B4

topology and various OSPF timers, but was measured to be approximately 7 seconds in our experiments. After the failure was detected, the access circuit was switched from AR1 to AR3. The switching time is bounded by the signaling time through the transport network and the cross-connection times, and was measured to be well below a second in our setup. Data link layer resynchronization was then completed in a few seconds, without optimizations [2]. Our external configuration server was used to download customer configuration to the BAR after failure in less than a second. Fig. 3a: BGP without graceful restart

Fig. 3b: BGP with graceful restart extensions

upstream

downstream

200

200

150

150

Packets Received

Packets Received

downstream

100

50

upstream

100

50

0

0 0

10

20

30

40

Time (seconds)

50

60

70

80

0

10

20

30

40

50

60

70

80

Time (seconds)

Figure 3: Throughput curves with and without extensions in the customer router

Figs. 3a and 3b depict sample throughput curves as a function of time for traffic sent between C1 and C3. Figs. 3a and 3b correspond to results without and with the BGP graceful restart extensions implemented between the customer and AR, respectively. Note that the actual restoration times vary according to when the failure occurs relative to different BGP timeouts. In both cases, the failure of AR1 occurred at t = 5, and no traffic could be routed to the destination whilst the physical layer and customer configuration were being moved to the BAR. However, the dominant impact on recovery times was introduced by BGP – establishing the BGP sessions between the customer and the BAR and then exchanging routing information over this session. In both figures, we observe that the restoration times are longer for the downstream traffic (from the network to the customer router C1 – the solid line in Figs 3a and 3b) than the upstream traffic (from C1 to the network – the dashed line in Figs 3a and 3b). This is primarily because the BAR has to learn about the customer routing information, and propagate this throughout the entire backbone network and to other customers before traffic can be routed to the customer. In contrast, the customer only has to learn the network routing information directly from the adjacent BAR. We also observe from Fig. 3a that the upstream traffic is successfully received by the destination between t = 17 and t = 58 seconds, but is lost again at t = 58, and is finally recovered again at t = 66. This is because once physical connectivity is established to the BAR, the customer sends traffic into the network until the customer’s BGP session detects that it is now talking to a new router (the BAR), whereupon the customer drops its BGP session believing that there has been an error (t = 58). The BGP session is eventually re-established with the BAR and traffic flows again (t = 66). Without the BGP re-start mechanism, the customer drops all of the routes and thus cannot send traffic into the network until the session is re-established again to the BAR. We can avoid this by extending the graceful re-start mechanism to inform the customer that the change in router is deliberate, then we can avoid the customer dropping all of the routing information and traffic, thereby dramatically reducing the recovery times. Figs. 3a and 3b demonstrate the impact that our graceful re-start extensions implemented on the customer router and AR has in significantly reducing the average recovery times. 4. Conclusions This paper demonstrated a novel scheme to cost effectively recover from failures of access routers (ARs). The approach requires tight integration between the IP and optical layers, but does not necessarily require extensions to the routing protocols used in IP networks today. However, simple routing protocol extensions on ARs and even customer routers can be used to improve recovery times. 5. Bibliography [1] e.g. Cisco Systems, “Hot Standby Router Protocol (HSRP),” http://www.cisco.com/en/US/tech/tk648/tk362/tk321/tech_protocol_home.html [2] P. Sebos et al., “Ultra-fast IP link and interface provisioning with applications to IP restoration,” OFC 2003 [3] Bala Rajagopalan, editor, " User Network Interface (UNI) 1.0 Signaling Specification," OIF2000.125, 2000 [4] Srihari R. Sangli et al., “Graceful Restart Mechanism for BGP,” IETF Draft Aug 2003 [5] J. Yates et al., A GMPLS-based Control Plane Prototype For End-to-end Services, to appear, Optical Networks Magazine