Failure Detection and Notification in GMPLS Control Plane Pawel Rozycki and Janusz Korniak University of IT and Management Rzeszow, Poland Email:
[email protected]
Abstract—The paper is concerned with signaling related to failure detection and notification in a GMPLS-based control plane. Limitations of the current methods are discussed and some improvements are proposed. The notification delay issue is investigated by using simulation methods. The simulation results confirm validity of the proposed enhancements.
I. I NTRODUCTION The architecture of the control plane with separated GMPLS planes was discussed in many works, for example [1],[2]. Protection procedures for GMPLS include, traditionally, the following steps: • • • • • • •
failure detection, failure localization, hold off, notification, recovery operation, traffic recovery (switchover), control plane state recovery.
Each step introduces a particular bit of delay into the overall protection delay which is one of the most important factors in providing high-quality services. Most of these delays depend on used protocols and procedures in the control plane. The exception here is the hold-off time and switchover time (dependent mainly on techniques used in lower layers, e.g., the data link layer). Separation of the planes implies complication of failure detection, especially in the case of the asymmetrical architecture. A typical method like exchanging Hello messages between neighbors is not efficient for the asymmetrical architecture. This issue is discussed in details in Section II. At the same time, separation of the planes offers the new possibilities for notification improvements, especially in the context of using the asymmetrical architecture. Section III shows that the notification delay may be one of the most important factors in providing high-quality services and achieving GMPLS network survivability. The paper suggests and discusses multiple solutions for the improvement of failure detection and notification procedures. Some of these solutions have been verified and analyzed by simulations.
Andrzej Jajszczyk AGH University of Science and Technology Krakow, Poland
II. FAILURE DETECTION IN CONTROL PLANE BY RSVP H ELLO Hello extension for RSVP has been proposed in RFC3209 [3]. In the case of in-band signaling this mechanism can fulfill failure detection for both data and control planes. For outof-band signaling this method can be used for control plane failure detection. One important difference occurs between in-band and out-of-band signaling. In the case of in-band signaling the RSVP messages are sent to direct neighbors. For example, to set up an LSP tunnel the Path message is sent hop-by-hop through the planned path to the egress router. Therefore, failure detection by exchanging Hello messages between direct neighbors is effective. In the case of out-ofband signaling the use of this method can be ineffective, especially for the asymmetrical architecture. There are two situations in which exchanging Hello messages between direct neighbors is not efficient. The first case occurs when RSVP nodes are separated by a non RSVP host. The failure detection is limited in this case due to TTL set to 1. The second case occurs when RSVP nodes are separated by other RSVP nodes. This situation is presented in Figure 1. If the LSP is created through LSR1 and LSR3, signaling messages are passed through N1, N2 and N3. Therefore, the failure of the link N2– N3 cannot be discovered by N1. This happens because hello packets are still exchanged between N1 end N2. There are two possible solutions of this problem. The first one is to not use the hello for failure detection in the control plane. However, reliability of the control plane cannot be achieved without failure detection. Therefore, other methods should be used. For example, Link Management Protocol [4] can replace the RSVP Hello mechanism. The second solution is to allow for multi-hop hello exchange. This solution can be performed by a direct logical connection created by some technology like IP or MPLS tunneling. The use of RSVP Hello for failure detection in the control plane requires additional modification of signaling procedures triggered by failures in this plane. The failure of a link in the control plane is not equivalent to the switch-over of the data plane traffic. For example, signaling traffic can be protected and the reliability of the control plane is improved. In order to allow time for protection or convergence, a solution, like
•
Fig. 1.
Examples of asymmetrical architecture
graceful restart mechanism described in [6], is proposed. This procedure allows to avoid unnecessary switch-over in the data plane. After the control plane fails the restart time timer is activated. During this period of time the node waits for establishment of a new session with the neighbor. State information of both control and data planes is held. In this period, the data plane traffic can be sent successfully but signaling traffic is affected by the failure. Different types of messages can try to be sent through the failed link. For example, if an LSP tunnel needs to be released the PATH TEAR is sent and it cannot reach the egress router. Similarly, if a new LSP has to be established through the path for which the control link is failed, messages cannot reach the destination (egress router). The protection of the control plane plays an important role in this situation. However, due to a finite time of backup signaling tunnel establishment some messages can be lost and can cause switch-over in the data plane. Therefore, error notification should include information indicating temporary signaling problems. This information received by an ingress router can be used to take a decision: wait for a backup signaling tunnel or start protection procedures in the data plane. Control plane reliability with detection by RSVP Hello is discussed in Section IV. III. N OTIFICATION Another important step in the process of failure neutralization is notification. Reliable and effective methods for transfer of such information is a key element for a fast reaction to failure and starting a protection procedure. It is also important to note that for a given connection the delay of LSP establishment (tP ath ) is not as important for service survivability as the protection delay, so the notification delay (together with the detection delay and other delays associated with the protection mechanism) should be one of the most important factors. A notification should be generated in the case of a data plane failure and can be triggered by: •
data plane failure detection (in both node and link failure) – performed by data plane devices like OXCs, e.g., Loss
of Light, or dedicated signaling procedures like these proposed in LMP [4], control plane’s node failure detection – when restart timer of the RESTART CAP object expired and the connection is not reestablished (it means that a control plane module of this node is failed or that an alternative path to the neighbor does not exist what should be treated as a failure of the whole node – both the control-part node and corresponding data-part node).
In the GMPLS environment, with separated functional planes and out-of-band signaling, notification messages should be generated also in the case of a control plane failure to support control plane protection procedures especially when signaling in the unassociated mode is used. In this case standard failure detection, based on exchanging Hello messages between neighbors, are not sufficient to detect an RSVP session failure when neighbors, in the context of signaling, are not directly connected. Notification may be triggered by the node which detects the failure. Therefore, some improvements are needed as discussed in the previous section. The ingress or egress node of the signaling channel may detect the failure based on RSVP Hello sessions with other nodes, working on the same interface, or based on LMP Hello [4] failure detection (the suggested failure detection timer should be set to 500 ms for directly connected nodes and to even higher values for signaling in the unassociated mode). The following strategies used to inform other nodes about a failure are known: • •
per-failure – the notification method described in [5], and based on fast flooding failure information, per-LSP – the notification method based on sending signaling messages for each affected LSP. This method or rather a per-GMPLS tunnel is suggested by the GMPLS standard track [6].
The per-failure mechanism has potentially several advantages such as smaller number of required notification messages to inform all nodes about a failure, especially in the case of a failure of a link or node which carries a large number of LSPs. Such a mechanism may be implemented using the LMP protocol by adding some new functionality [7]. This method, however, duplicates the normal task of routing protocols and may dramatically increase the total network traffic. Note also that there is no requirement to send a rapid notification about the fault to all nodes in the given domain but only to nodes which participate in the protection mechanisms, such as the ingress node. The rest of nodes should be informed about the failure by using routing mechanisms. Such a solution is presented in the per-LSP notification mechanism. The per-LSP notification may be, however, implemented in several methods. The commonly used implementation is based on signaling protocols such as CR-LDP or RSVP-TE, described in this paper. Other implementations may be associated with underlying, technique-specific, operations and management (OAM) mechanisms for both in-band and out-of-band signaling [8]. Note, however, that for all-optical networks, e.g., based on
lambda- or fiber-switching, the standard OAM mechanisms (typically implemented using SDH or OTN header fields) may be limited or unavailable, so mechanisms based on the out-ofband signaling protocols should be defined as the generalized method of notification available in the common control plane. As mentioned in [9], in the case of end-to-end path protection the Failure Indication message should be sent to the source of the LSP. The path by which this message is transmitted may not be the same as the path of the LSP. This message may contain information about more than one affected LSP. This message should be acknowledged by the Failure Acknowledgment message. Next, the LSP source sends the Switchover Request message to the LSP destination. This message contains ID of the affected LSP. In response, the LSP destination sends a positive or negative Switchover Response message. In general, RSVP supports two mechanisms for failure notification: • transfer of the Path Error Resv Error message, where the path of transmission is the same as the LSP path, • transfer of the Notify message. It is clear that usage of the Notify message to implement notification should be faster because IP packets containing this message are not enforced to flow across the LSP path. The performance difference between these two options will be shown in the simulations. A. Control Plane Architecture The type of signaling method (in-fiber or out-of-fiber) and used architecture (symmetrical or asymmetrical) may have a great importance for the efficiency of the notification mechanism, especially for the delay of notification (time needed to transfer failure notification). The asymmetrical architecture of GMPLS planes may be the result of using the out-of-band signaling. In this case the control plane can be a separate IP network and the topologies of the control plane and the data plane may be different. It means, in particular, that adjacent nodes (in the data plane) may not be directly connected in the control plane. In general, the time needed to notify the LSP source about a failure is determined by the sum of message propagation time along the notification path plus the time of message processing inside each node. The notification delay is, therefore, described by the following equation: X X tproc n (1) tN = tprop l + l
time. In the case of the symmetrical architecture (even in the case of in-band signaling) it is possible to route notification messages through a path that is not the same as the LSP path. This requires using some routing intelligence. In such a case notification delay, denoted by tN S , should be smaller than tN P . In the asymmetrical architecture, however, where the topology of the control plane is not the same as the topology of the data plane, shorter paths from a given node to the LSP source, may exist. In such a case the notification delay tN A may be smaller than tN S . The maximum notification delay for a given LSP can be defined as follows: tmax
N
= max{tN 1 , tN 2 , ..., tN (n−1) , tN n }
(2)
This notification delay may be calculated by the ingress node based on routing information, so during selection of a path for a new LSP the ingress node is able to take this parameter into account, and calculate a path for the new LSP to obtain minimal tmax N with the accepted cost. B. Implementation of Notification for RSVP The notification mechanism was originally proposed for optical networking in [10]. The authors of this draft suggest to create a new RSVP message called Notify. This message has been modified during working on the GMPLS standard track in [6]. In this implementation the Notify message is used to inform non-adjacent nodes about any LSP related event. This message contains the ERROR SPEC object with the address of the node that detects the failure and a list of affected RSVP sessions associated with the broken LSP [11]. As mentioned earlier, each Failure Indication message should be acknowledged by sending a Failure Acknowledge message. According to [6], it may be implemented by using the Ack message defined in RFC2961 [12], which carries MESSAGE ACK ID. This object is associated with MESSAGE ID described in the same document and is referred to as MESSAGE ID Extension. This set of procedures works, however, between neighboring RSVP nodes, but may be easily applied to use between remote nodes as well. The ingress node which receives the notification message and is able to switch a given LSP to the backup path should send a notification message called Switchover Request message to the egress node. Also, this message may be implemented by the Notify message, but this should be extended by including additional flags to perform identification of the failed LSP. This message should be acknowledged by the Ack message, too.
n
where l is the number of links and n is the number of nodes on the notification path. In the case of the symmetrical architecture (without any routing intelligence, for example for in-band signaling) the notification will be transferred by the same path as LSP, so the delay, denoted by tN P , is similar, but a bit smaller (no intermediate nodes where the message must be processed), to the time needed to transfer the Path message, denoted by tP ath . Note that tP ath may be treated as a reference
IV. S IMULATIONS A. Simulation Tool and Used Network Topology To quantitatively analyze notification performance and the efficiency of using multi-hop RSVP Hello several simulation scenarios have been defined. The simulations are prepared with the use of the Network Simulator (NS-2) simulation environment [13], with an additional patch [14] and extensions prepared by the authors of [15]. These extensions allow us to simulate GMPLS behavior
be caused by link failure in both data and control planes. Variant C is similar to B, however, multi-hop exchange of Hello messages is allowed. In this case, messages including the signaling problem indicator are sent to the ingress router. However, it does not initiate the switch-over procedure for the data plane traffic. In both B and C variants the control plane is protected by rerouting. Table II presents common parameters for all simulation variants. TABLE I DATA
OF THREE SIMULATION VARIANTS
A 0 1
Number of link failures in control plane Number of hops for RSVP Hello Fig. 2.
B 8 1
C 8 5
Network topology used in the simulation
with out-of-band signaling and the asymmetrical architecture. The following functionality has been added: • LSP set up, modification and release procedures for outof-band signaling, • control plane failure detection by RSVP Hello, • out-of-band link-state routing, • notifications. All implemented procedures have been tested and verified. All simulations are based on the same GMPLS network with separated control and data planes. The topology of the simulated network is presented in Figure 2. The data plane contains 24 nodes interconnected with 29 bidirectional links of 155.52 Mb/s and 2 ms delay, each. The control plane may be configured as symmetrical or asymmetrical. Nodes in this plane are connected with 2 Mb/s bidirectional links with 4 ms delay. In the NS-2 environment, similarly as in most other network simulators, the message processing delays are omitted and their values are treated as included in the propagation delays. B. Failure Detection In this section the efficiency of using multi-hop Hello is measured. The main task of the simulation is to show that the multi-hop hello mechanism and modified signaling procedures used by RSVP for failure detection in the control plane improve the survivability of the network. Two main criteria are used to measure these improvements: the number of user data packets lost and the number of switch-overs in the data plane. Three different simulation variants are compared to achieve the defined task. Table I shows the assumptions taken in these variants. Variant A assumes a failure-free control plane. Switch-overs are caused only by a link failures in the data plane. Variant B assumes an unreliable control plane. Failures are detected in the control plane by RSVP Hello messages. Only direct neighbors exchange these packets (multi-hop Hello equal to 1). After a link failure is detected, RSVP immediately sends messages to the ingress router to activate traffic switch-over in the data plane. The switch-over in the data plane can
TABLE II C OMMON SIMULATION DATA Simulation time Number of data flows Start time of flows Duration of flows Number of link failures in data plane Link down time Duration of link down state Protection of data plane Control plane: Link down time Duration of link down state Protection of control plane Restart time
60 s 30 random, uniform distribution random, normal distribution 8 random, uniform distribution random, normal distribution preallocated random, uniform distribution random, normal distribution by rerouting 200 ms
First, the flow and failure patterns were generated, and then each variant of simulation used this data. The source and destination of traffic flows are randomly chosen from the edge nodes. The flows are generated by a CBR traffic generator which sends streams of packets of a constant bit rate (1 Mb/s is assumed) using the UDP protocol. The simulation schedule assumes 0.2 s time for LSP establishment before the traffic start. Backup LSP is requested 0.2 s after the traffic start. When the traffic flow is terminated the primary and backup LSPs are released. Table III summarizes the results of the simulation. TABLE III S IMULATION RESULTS Number of switch-overs Number of lost packets during switch-overs Number of lost packets due to signaling problems
A 6
B 15
C 6
17
17
17
635
1394
549
Analysis of the results allows to state that multi-hop Hello exchanging for failure detection is very beneficial for the overall network reliability. The results in Table III show similar values for variants A and C despite of 8 failures in the control plane for variant C. This is possible due to the following two factors:
signaling error indicator prevents immediate switch-over of data plane traffic, • multi-hop RSVP hello exchanging allows detection of a new signaling tunnel and enables the refresh procedure. Variant B of the simulation shows the inefficiency of the RSVP Hello mechanism. The highest number of lost packets and switch-overs for variant B confirms this statement. In all variants 17 packets are lost during the switch-over procedure. Other packets, in variants A and C, are lost because traffic needs to be switched to the backup LSP but it is not established yet. In variant B, more switch-overs and lost packets are observed due to two reasons: same like in variants A and B and due to an error of primary LSP establishment.
TABLE IV S CENARIOS PARAMETERS
•
Scenario A B C D E F G H I J
Notify MSG Not used Used Not used Used Not used Used Not used Used Not used Used
Architecture Asymmetrical, Asymmetrical, Asymmetrical, Asymmetrical, Symmetrical Symmetrical Asymmetrical, Asymmetrical, Asymmetrical, Asymmetrical,
without without without without
with with with with
2 2 6 6
5 5 2 2
links links links links
additional additional additional additional
links links links links
C. Notification Performance To verify performance of the notification mechanisms several scenarios based on the same data plane topology have been prepared. The main criterion for performance verification of notification is the delay of notification, defined as a period of time between failure detection (and sending notification) and receiving notification about the failure by the ingress node. Similarly to simulations in the previous section, traffic and failures pattern is the same for all scenarios and was generated with the following parameters: • traffic: 50 LSPs with preallocated path protection, • failures: 10 broken links in the data plane. Table IV shows variable parameters used in the scenarios. Several scenarios differ in two properties. Supporting of the Notify message is the first parameter used in the case of failure detection. If this option is set, the notification may be obtained by receiving the Notify message or the ErrorPath message, otherwise only the ErrorPath message is used. Performance of these two notification options are checked for various control plane architectures in terms of the functional planes symmetry. These architectures can be, therefore, symmetrical or asymmetrical. This factor is used as a second variable of the scenarios. Unfortunately, the asymmetry may have many variants and degrees, and one common coefficient is hard to define. The asymmetry factor is, therefore specified, by description. In the simulation scenarios the following modifications have been introduced in the control plane topology showed in Figure 2: • scenarios A and B: removed links 0–1, 4–5, 8–9, 7–12 and 2–13; • scenarios C and D: removed links 2–3, 6–7; • scenarios G and H: added links 3–6, 2–7; • scenarios I and J: added links 3–6, 2–7, 7–10, 3–12, 2–15 and 6–13. The results are presented in Figures 3 and 4. Figure 3 shows the maximum notification delay for a given configuration. If using of the Notify message is disabled, the maximum notification delay stays more or less constant for the network with additional links and is the same as for the symmetrical architecture (scenario E, G and I with the maximum notification delay equal to 31.1 ms, 31.6 ms and 31.2 ms, respectively). If the asymmetry is caused by removing some
Fig. 3.
Maximum notification delay
links, the notification delay increases and for the network with five removed links achieve 51.1 ms (scenario A). For the network where the Notify message may be used, the notification delays are shorter. Moreover, this value is decreasing for the architecture with additional links and in the presented results oscillates around 20 ms. This happens because Notify messages are transmitted based on the control plane routing, so each additional link in this plane allows potentially to find a shorter path to the destination (the ingress node in this case). On the other hand, the delay increases for networks where some links are removed. Figure 4 presents mean notification delays for working LSPs, for backup LSPs and the overall mean for all LSPs. It is important to consider these cases separately because the backup LSP, as usually longer than the working LSP, has a
Fig. 4.
Notification delay for working and backup LSP
significant contribution into the overall average value of the notification delay. As shown in Figure 4 the difference in the notification delay for these kinds of paths stays large for all scenarios without the Notify message, even for the network with additional links. For scenarios with the Notify message these differences are smaller and for networks with additional links they almost do not exist (scenarios H and J). This fact happens because, when additional links are added, the mean value of the shortest paths, in the control plane, from all intermediate nodes to the ingress node is similar for both working and backup LSPs, even if the backup LSP is much longer than the working LSP. Note also that two additional links (scenario H) in the network cause a similar notification improvement as 6 additional links (scenario J). It implies that some links play a more important role in the process of notification than others. In fact, in our network additional links 3–6 and 2–7 are much more important than links 7–10 or 2–15 because, for example, 15 LSPs pass through nodes 2 and 7 while only 11 LSPs, including working and backup LSPs, pass through nodes 2 and 15. Note that in the simulated network nodes 2, 3, 4, 5, 6, and 7 create a sort of network backbone and links between these nodes are most utilized. The additional links between these nodes should give, therefore, the biggest improvement in the notification process. V. C ONCLUSIONS In order to provide a reliable control plane protection of the control plane should be used. As the protection method IP rerouting can be used. In this case protection depends on a routing protocol behavior. Typically, time from the protection request to switch-over completion is longer than an acceptable value in the data plane. Therefore, faster detection of a signaling link failure is needed. The results of the simulation confirm a very important role of failure detection in the control plane. The RSVP Hello mechanism used for failure detection of a link between direct neighbors is not effective. The asymmetrical architecture of the GMPLS network is the reason of this inefficiency. Failure detection of the signaling tunnel is needed. One of the solutions is to allow for multi-hop RSVP Hello exchange. The simulation results confirm this statement. Other solutions for signaling tunnel management could also be considered. The signaling failures in the control plane should be notified to the ingress and egress routers. The main reason to do this is to allow the ingress router to manage the data LSP for which the signaling tunnel is failed. In some cases the data plane LSP may work unaffected and wait for new signaling tunnel establishment. In other cases the protection reaction can be forced to provide traffic services immediately. For example, if a new LSP should be created. Appropriate error codes should be used to differentiate between other errors. The presented results show also that selection of the architecture of the control plane and used mechanisms have an important influence on the notification delay and, consequently, also on the protection performance. It should be noted
that using of the Notify message to perform notification is a good solution and should be applied. This option decreases the notification time especially for networks with the asymmetrical architecture and reduces differences in the notification delay between alternative paths that should be taken into consideration during selection of protection strategies. The simulations show, moreover, that additional links improve the notification process, but the improvement depends on the current utilization of the connected nodes. Note, however, that the notification improvement shown in the simulation is not very significant due to the simplification of the simulation model: • no message processing delays; additional delays will be generated by processing of RSVP messages; it is important because, as mentioned earlier, Notify messages are not processed by the intermediate nodes; • very small and regular topology with constant delays, the same bandwidth on each link, etc.; for example, in a larger and less regular network, the backup paths or even working paths may be much longer and much more complicated, so notification delays and delay differences between particular paths also should be much bigger. R EFERENCES [1] G. Li, J. Yates, D. Wang, C. Kalmanek, “Control plane design for reliable optical networks”, IEEE Communication Magazine, pp. 90-96, February 2002. [2] A. Jajszczyk, P. Rozycki, “Recovery of the Control Plane after Failures in ASON/GMPLS Networks”, IEEE Network, vol.20 No.1, Jan/Feb 2006. [3] D. Awduche, Ed., “RSVP-TE: Extensions to RSVP for LSP Tunnels”, RFC3209 [4] J. Lang, Ed., “Link Management Protocol (LMP)”, RFC4204 [5] R. Rabbat, V. Sharma , Ed., “A Fault Notification Protocol for GMPLS-based Recovery in Shared Mesh Networks”, draft-rabbat-faultnotification-protocol-05.txt (work in progress) [6] L. Berger, Ed., “Generalized Multi-Protocol Label Switching (GMPLS) Signaling Resource ReserVation Protocol-Traffic Engineering (RSVPTE) Extensions”, RFC3473 [7] T. Soumiya, R. Rabbat, Ed., “Extensions to LMP for Floodingbased Fault Notification”, draft-soumiya-lmp-fault-notification-ext-01.txt (work in progress) [8] T. D. Nadeau, T. Otani, D. Brungard, A. Farrel, “OAM Requirements for GMPLS Networks”, draft-nadeau-ccamp-gmpls-oam-requirements01.txt (work in progress) [9] J. Lang, B. Rajagopalan, D. Papadimitriou, Ed., “Generalized MultiProtocol Label Switching (GMPLS) Recovery Functional Specification”, RFC4426 [10] J. Lang, K. Mitra, J. Drake, “Extensions to RSVP for optical networking”, draft-lang-mpls-rsvp-oxc-00.txt (work in progress) [11] R. Braden, Ed., “Resource ReSerVation Protocol (RSVP) – Version 1 Functional Specification”, RFC2205 [12] L. Berger, D. Gan, G. Swallow, P. Pan, F. Tommasi, S. Molendini, “RSVP Refresh Overhead Reduction Extensions”, RFC2961 [13] VINT project at LBL, Xerox PARC, USB and UCS/ISI The Network Simulator ns-2 http://www.isi.edu/nsnam/ns/ [14] C. Callegari, F.Vitucci, RSVP-TE patch for MNS/ns-2 http:// netgroupserv.iet.unipi.it/rsvp-te ns/ [15] J. Korniak, P. Rozycki, “GMPLS – simulation tools”, Proceedings of the 1st Conference “Tools of Information Technology”, Rzeszow, Poland, 2006