Fault Tolerance and Resilience Issues in IP-Based Networks - CiteSeerX

5 downloads 13570 Views 27KB Size Report
Network Service Providers (NSPs) have to report network outages in their ... use e-commerce for their business to business trade, others offer their services to.
Fault Tolerance and Resilience Issues in IP-Based Networks Achim Autenrieth Institute of Communication Networks, Munich University of Technology Arcisstr. 21, 80290 Munich, Germany Phone: +49 89 289 23504, Fax: +49 89 289 23523, E-mail: [email protected] Andreas Kirstädter Corporate Research, Information and Communication, Siemens AG Otto-Hahn-Ring 6, 81730 Munich, Germany Phone: +49 89 636 47484, Fax: +49 89 636 51115, E-mail: [email protected]

Abstract A huge effort is currently put into the research of Quality of Service issues in the internet. Until recently, only little interest was given to the question of the resilience of IP based networks. However, with the growing commercial importance of the internet and the development of new real-time, connectionoriented services like streaming technologies or IP-telephony the resilience is becoming a key issue in the design of IP based networks. In this paper some aspects of the IP resilience and Internet survivability are discussed and the resilience requirements and objectives for a next generation internet architecture are defined. The factors which influence resilience within the IP layer are discussed, taking into consideration also the network layers below IP. Finally, some proposals are made to increase the resilience in IP-based networks. Introduction The internet started from the military ARPA network. The main requirements of the ARPA network was a robust network with a good reachability of individual hosts and reliable services even with unreliable network elements (like routers or links). The internet soon started to grow drastically, and the numbers of hosts and routers increased explosively. The internet changed from a military and university network to a global information network. The goal for the development of intra-domain and inter-domain IP routing mechanisms was a maximum of robustness against congestion problems and topology changes. However, the reliability performance in terms of a fast reconfiguration after network element failures was of less importance, since the main services offered had little or no realtime constraints. A statement about the current ’internet availability’ is very difficult to make due to the distributed administration of the internet [1]. So far, no central institution exists where large Network Service Providers (NSPs) have to report network outages in their administrative domain. Such reports could contain the failure which caused the outage and an indication of the effects caused by the outage (e.g. lost traffic). The only indication for the health of the internet are so called internet weather reports, where the current load of major IP backbone router is analyzed and congestion situations can be observed. However, if a host is unreachable, it is currently not distinguishable for an end user, if the host itself is down, if any network element along the route to the host failed or if the problem is only a temporary overload situation on some link or router.

IP Resilience requirements In recent years new services and applications were developed with strong real-time connection-oriented characteristics. Such services include Voice-over-IP or the Real Time Streaming Protocol (RTSP), which was submitted to the IETF by RealNetworks and Netscape in 1996. Also in the transport layer new protocols were developed to support real-time services (e.g., Real Time Protocol, RTP) and to meet quality-of-service requirements (e.g., DiffServ, RSVP) in the internet. The failure of a major link or backbone router may have severe effects to these services and protocols. Classical, reliable (or robust) IP applications and services like FTP and HTTP in most cases only perceive a delay until the affected traffic is rerouted. After the rerouting is completed the services may experience a graceful degradation of their quality of service, since the alternative route can be longer or higher utilized. It must be mentioned, that also the traffic not directly affected by the failure but running (partly) over an alternative route are also affected by this graceful degradation. On the other hand, the duration of the outage due to a link or node failure is in most cased too long for real-time services to hold up their sessions, which will then be dropped. QoS-Flows could experience an unacceptable reduction of their quality of service on the alternative route, and can therefore not be reestablished. For these new services and applications advanced rerouting mechanisms have to be developed which realize a faster rerouting and traffic recovery, so that the sessions are not impaired. Additionally, the design of the internet architecture and the capacity planning should take alternative routes into account for IP-flows with quality of service guarantees. Another major trend in the internet is the growing importance of e-commerce. Many companies use e-commerce for their business to business trade, others offer their services to the customers via internet. A new kind of online companies emerged, which sell their products exclusively via the internet (e.g. Amazon.com). For all these companies the availability of the internet access and the connectivity to the customers is critical for their business. Network outages result in direct loss of revenue and, moreover, loss of reputation. According to Cisco, the annual loss of a large e-commerce site in case of a best-effort availability of only 99% can be estimated to 3.7 million dollar [2]. Such e-commerce companies require a non-stop internet availability of up to 99.999%. A main problem of such a high internet availability is that Network or Internet Service Provider can only guarantee the availability of their own network domain. Today it’s not possible to guarantee the network availability to any end user. Current IP Robustness From the above consideration it can be concluded, that resilience in IP is a clear requirement for current and future IP networks. Unfortunately, since the Internet was designed for a maximum of connectivity and robustness, mechanisms for a fast recovery of traffic affected by network failures were not foreseen. One problem is, that the detection of faults and failures in IP is very slow. Routing mechanisms depend on the exchange of reachability or hello messages to assure the upstate of adjacent routers. To reduce the signaling overhead caused by the routers and to increase the stability of the routing protocols, the time intervals these messages are sent out are relatively large (in the range of several seconds). For Open Shortest Path First (OSPF), a common Interior Gateway Protocol, three timers affecting the failure detection exist: the Advertisement Interval, the Hello Interval and the Router Dead Interval. In the OSPF protocol specification, no specific values for these timers are given. Commonly, the Advertisement Interval is set to

about 5 seconds, the Hello Interval is usually between 10 and 30 seconds with a resulting Router Dead Interval in the range of 40 seconds to two minutes [3]. Only after the Router Dead Interval expired, the respective router is assumed to be unreachable, and the routes will be recalculated. Obviously, such large failure detection times will lead to very poor service recovery performance and are intolerable for mission-critical or real-time internet services. The service restoration is done using the inherent selfhealing capabilities of IP routing protocols. The routes are recalculated in one router and advertised to the other routers in the network, which in turn update their routing databases. The service restoration is completed when the routing protocols converge. Traffic Engineering and Resilience Provisioning using MPLS A current effort of the Internet Engineering Task Force is the development of Multi-Protocol Label Switching (MPLS), which integrates layer 3 routing and layer 2 switching functionalities [4,5]. MPLS introduces connection-oriented characteristics in IP by replacing the routing of IP packets based on the IP header information with a switching based on a short, generic 4 byte label. The technology is independent from the used layer 2 technology, and several implementation proposals have been made for example for ATM and Frame Relay. The path, which an IP packet follows through the network and which is defined by the label is called Label Switched Path (LSP). The labels may also be stacked, allowing a tunneling and nesting of LSPs [4]. Such an LSP is set up by either a special Label Distribution Protocol (LDP) and its extensions for Constraint-based Routing (CR-LDP), or a Traffic Engineering extension of the Resource Reservation Protocol (TE-RSVP). A main benefit of MPLS for Internet Service Providers is the ability to introduce traffic engineering concepts in the internet network design [6]. With MPLS it is possible to assign different paths through the network for packets with the same source and destination addresses, based e.g. on their QoS requirements. Using the concept of the LSP, also the provisioning of resilience similar to classical link restoration or protection switching mechanisms is possible [7]. With link restoration, in case of a network failure a new LSP tunnel could be set up for a group of failed LSPs to route the traffic around the failed network element. If an end-to-end backup LSP would be set up at the same time as the working LSP, a path protection could be realized. Several IETF drafts and a framework proposal are discussed in the MPLS working group. MPLS offers interesting possibilities for the resilience provisioning in IP. Since the concepts are only recently issued as Internet Drafts and the discussion is currently ongoing, it will be interesting to see which mechanisms are feasible and will be incorporated in the MPLS framework and architecture. A drawback of resilience provisioning using MPLS is however, that these concepts are only valid within MPLS domains. At this time it is not sure, in what extend MPLS will be incorporated in the future IP architecture. Surely, many Internet domains will continue to use pure IP routing such as OSPF. Moreover, since it is an objective to provide an end-to-end resilience, also the Interdomain Routing Protocols interconnecting MPLS and non-MPLS domains must be taken into consideration. Aspects of Reliability and Survivability in IP-based networks In case of a failure in a IP-based networks, the affected traffic may be recovered in multiple layers. In the phyisical layer, transmission technologies like SDH/SONET, ATM or WDM offer fast protection and restoration mechanisms, which can recover the disrupted connection in up to 50 ms from the time of the failure detection. In IP, service restoration requires the

convergence of the routing protocols like OSPF or BGP, which takes considerably longer, i.e. several seconds and more. Multilayer Resilience and Layer Interworking The presence of resilience mechanisms in both, the IP and lower layers lead to some new multilayer issues. When deciding on an internet architecture, it has to be discussed which are the advantages and disadvantages of resilience mechanisms in the IP layer and in lower layers, and in which layer resilience mechanisms will most probably reside. •

Resilience mechanisms present in lower layers only The advantage of this scenario is, that well established resilience mechanisms with short recovery times like Automatic Protection Switching (APS) or Multiplex Section Protection (MSP) can be utilized. The connection-oriented network technologies have asophisticated network management with performance monitoring and failure management which allows a fast and efficient detection and isolation of failures and recovery of disrupted traffic. The main drawbacks regarding IP services is, that resilience in lower layers can only be offered at a very coarse granularity and a low protection flexibility. It is not possible to protect individual services, but only connections at STM-1 level or higher bandwidth, which results in a high cost for the resilience. A second drawback of resilience mechanisms in lower layers is that failures in the IP layer like a router outage cannot be solved in the lower layers. The highly competitive market will lead ISPs and NSPs to use mainly unprotected connections for Internet services and rely on IP rerouting for IP service recovery.



Resilience mechanisms present in IP layer only In this scenario, resilience mechanisms are employed in the IP layer, and no recovery will be done in the lower layers. With mechanisms such as MPLS, the recovery performance will be in the same order as for example SDH protection switching mechanisms. An advantage is, that the resilience can be employed for individual services requiring such a high availability. Such differentiated resilience leads to a cost advantage in comparison to the previous scenario. The unprotected services with lower resilience requirements will be rerouted by the used routing protocol and may experience a graceful degradation of their service quality. A drawback is that in order to protect native traffic in the lower layers, these layers must be able to offer a certain degree of protection selectivity. That is, the native traffic has to be protected, while at the same time unprotected traffic is carried. Another drawback of protection in the IP layer is, that in case of a failure at the physical layer (e.g. a cable break), a large number of individual flows must be recovered in the IP layer. The lower layer could recover these failures much more easily.

Both scenarios have their advantages and drawbacks. Moreover, in a real network it is often not possible to ensure, that resilience mechanisms will be exclusively present in a single layer. In many cases, different mechanisms will operate at the different layers in parallel. This would ideally offer the possibility to exploit advantages from both layers - a fast recovery of physical failures and the additional recovery from failures in the IP layer. However, the presence of resilicence mechanisms in multiple layers leads to a contention between the mechanisms. This contention may lead to sub-optimal recovery, in some cases locking each other or resulting in routing instabilities. Additionally, spare resources and available network bandwidth could be inefficiently used. Therefore, this multilayer resilience scenario requires the interaction or the awareness of resilient strategies present in the different layers.

Well defined signaling interfaces between IP and the lower layers would to allow a failure signaling and interworking between the layers. E.g., IP routing mechanisms could directly react on operation, administration and maintainance signals from the lower layer, either delaying or triggering the rerouting process. In the most simple case, such an interworking could be a configurable hold-off time for mechanisms in one layer, allowing the other layer the completion of the recovery process. In [9] a framework for multilayer interworking is introduced, which can also be mapped to IP based multilayer networks. Failure detection and notification To increase an overall IP resilience the first requirement is to speed up the detection of failures. As it was shown in a previous section, the failure detection in IP is too slow for services requiring a high degree of resilience. An interlayer failure signaling to reduce failure detection time is proposed as a simple and helpful layers interworking. The failure detection could be improved, if the failure is detected at the physical layer and notified to the IP router. In transmission networks like ATM, SDH or WDM, failure detection and propagation within a layer and between the different technologies is well defined. The failures detected at the physical interfaces of IP routers could be signaled to the router software and processed by the routing protocols. Upon receiving an lower layer alarm the IP router needn’t wait for the expiration of multiple hello times, but can immediately set the state of the network element identified by the lower layer alarm signal to failed. Subsequently, the router recalculates the routes affected by the failed network element, (either a link or a node), and advertises the route changes to other routers. Thus, the use of a defect signaling from the physical layer to the IP may drastically decrease the defect detection time to the order of one millisecond. An advantage of this method is, that only very simple changes have to be done to the routing protocols themselves. The failure signal could be interpreted as a forced timer expiration of the router reachability protocol. In OSPF, this would be for example the Router Dead Interval. In BGP and IDRP, a Holding Time is defined, which monitors the timely reception of successive Keepalive or Update messages. Once a failure is detected and notified to the routing protocol, the affected traffic and services will be restored either using the intrinsic selfhealing capabilities of the existing IP routing mechanisms or with advanced rerouting strategies. The rerouting is done by a recalculation of the routes and advertising the new routes to neighbouring routers. Even with a decreased failure detection time, the convergence of the routing protocols may still need several seconds. To increase the resilience of IP, the rerouting performance, i.e. the calculation of alternative routes and the reconfiguration of the routing tables in the router, should thus be increased. This could be achieved by using routing protocols which maintain multiple routes to the same destination. Then, in case of a failure, the IP traffic could be switched from the primary path to an alternative path. Signaling Issues and Extended QoS The provisioning of an alternative and disjunct path for a certain flow with resilience requirements results in additional management entries or an increased virtual load in the network if bandwidth has to be statically reserved on the alternative path. Thus resilience should only be provided if needed by the application requiring a signaling method between the application and the network. At a closer look it becomes obvious that resilience requirements of single applications are orthogonal to their “classical” quality-of-service requirements (bandwidth, delay, delay jitter):

Some examples: Application requires resilience yes

no

y e s

mission-critical VoIP and video streaming

“normal” VoIP

n o

database transactions, mission-critical control terminals

e-mail, file transfer

Application requires QoS



• •

Remote database transactions: Delay jitter and bandwidth fluctuations are less critical but the database is commonly locked during a transaction in order to avoid consistency problems. If due to a link or node failure the current transaction is interrupted other transactions will stay locked out until a time-out occurs. Thus the number of possible transactions per time interval is reduced as will be the turnover in a commercial application. It has to be mentioned that any failure between the database and the current customer will lead to this problem; it is not limited to failures on the access link to the network of the database host. Mission critical voice-over-IP: Calls covering financial matters and discussions between business executives don’t go along well with interruptions. They require both quality-ofservice and resilience. Other multimedia-over-IP applications: Quality-of-service may be required dependent on the user desires but resilience might not be necessary as far as they come along with additional costs and network failures are expected to have very low probabilities.

This observation of orthogonality brings us to the concept of postulating an extended qualityof-service: the combination of the commonly discussed quality-of-service in terms of bandwidth and delay together with the resilience requirements of the application. Thus the correct way for signaling the resilience requirements is to include the corresponding signaling into the quality-of-service signaling between the application and the network. Corresponding to the different quality-of-service approaches (IntServ, DiffServ) this could either be done on a per flow or a per packet basis: •



Signaling per flow and end-to-end: The RSVP message formats are extended in the sense that the end user’s terminal is able to signal a resilience requirement to the network in addition to the bandwidth requirement. The network then (additionally) reserves an alternative and disjunct route for this flow and switches it to this route in case of a link or network element failure. Note: If no purely QoS-oriented RSVP path had been identified in advance the network has to consider the path the packets would take under non-failure circumstances before the disjunct path can be identified. This could be achieved by observing the flow of RSVP messages. Signaling per packet and hop-by-hop: The Network Management or a special resource control establishes a set of pre-defined routes – together with the reservation of the

corresponding bandwidth – in advance, according to the estimated or negotiated (by service level agreements) amount of traffic having resilience requirements. The packets with resilience requirements then are marked either by the application or the edge device when they enter the DiffServ network. For this marking one of the bits in the TOS field in the IP header could by occupied. In the case of a failure of either a link or a network element the network then only switches those packets to an alternative path that have the resilience bit set in their headers. Conclusion and Outlook New internet services with increased resilience requirements are already being offered or currently emerging. Real-time services cannot tolerate long outages, and global e-commerce and mission critical internet services require a maximum of availability and a minimum of network outage times. MPLS is an example, where such resilience requirements are directly taken into account for the development of a new internet protocol. In future, resilience mechanisms in IP protocols with a recovery performance similar to that of SDH or SONET will be available. The requirement for fast resilience mechanisms in IP results in new aspects for IP-based networks. One aspect is related to multilayer resilience issues and layer interworking between the IP layer and the lower layer network. A closer integration and higher awareness of the involved layers is proposed. A failure signaling from the physical layer to the IP layer could decrease the failure detection time in the IP layer, reducing the service outage. Additionally, an extension of the quality-of-service definition could allow the signaling of resilience requirements from different services, thus allowing a differentiated resilience for individual services. References [1]

[2]

[3] [4] [5] [6] [7] [8]

ANSI T1 Committee Report, "A Technical Report on Reliability and Survivability Aspects of the Interactions Between the Internet and the Public Telecommunications Network", Report No. 55, October 1998. David Passmore, "Scaling Large E-Commerce Infrastructures ", Packet Magazine Archives, Cisco Systems, 3rd Quarter 1999. (http://www.cisco.com/warp/public/784/packet/july99/1.html) Kevin Washburn, Jim Evans, "TCP/IP - running a successful network", AddisonWesley, England, 1997. R. Callon et al., "A Framework for Multiprotocol Label Switching", IETF Internet draft, work in progress, September 1999. E. Rosen et al., "Multiprotocol Label Switching Architecture", IETF Internet draft, work in progress, August 1999. George Swallow, "MPLS Advantages for Traffic Engineering", IEEE Communications Magazine, Vol. 37, No. 12, December 1999. Thomas M. Chen, Tae H. Oh, "Reliable Services in MPLS", IEEE Communications Magazine, Vol. 37, No. 12, December 1999. P. Demeester et al., “Resilience in Multilayer Networks”, IEEE Communications Magazine, Vol. 37, No. 8, August 1999.

Suggest Documents