Resilience in a multi-layer network - CiteSeerX

11 downloads 37195 Views 44KB Size Report
the way a server layer supports the spare capacity of its client layer(s) has .... budget, operators will not be urged to pay for the higher reliability degree ... amongst all connections (under supervision of the recovery scheme) or dedicated to a ...
Resilience in a multi-layer network *

P. Demeester, M. Gryseels , K. Van Doorselaere A. Autenrieth C. Brianza G. Signorelli R. Clemente, M. Ravera A. Jajszczyk, D. Janukowicz G. Kalbe Y. Harada, S. Ohta A.G. Rhissa Correspondence:

University of Gent-IMEC, B Technical University Munich, G Italtel, I SIRTI, I CSELT, I ITTI, P Belgacom, B NTT, J INT, F

Piet Demeester IMEC-University of Gent Department of Information Technology Sint-Pietersnieuwstraat, 41 B-9000, Gent, Belgium Tel.: +32 9 264 3334, Fax : +32 9 264 3593 E-mail: [email protected], WWW: http://www.intec.rug.ac.be

Abstract. The integration of different technologies such as ATM, SDH and WDM

in future multi-layer transport networks requires a better coordination between the individual network layers. Especially in the area of network survivability, it is felt that much can be gained by a better coordination and integration of the healing actions taken by the different layers in the event of failures. We present a framework for survivability strategies in multi-layer networks, identifying different options that can be taken to recover in a network with multiple recovery schemes active in different layers.

1. Introduction The resilience of public telecommunication networks gets intense attention of all major players in the telecom world. The adoption of optical, SDH and ATM broadband technology in transport networks has resulted in an ever-increasing concentration of more traffic on fewer network elements. Consequently, failures can potentially cause more burden than ever before, particularly to customers whose communications are of vital importance. Preventive actions such as fire safety plans, armored cables, etc. are able to reduce the failure occurrence frequency. However, such measures are in general not sufficient for operators (and their customers) for whom even the rarest failure may cause a substantial loss of revenue. As a result, the last years we have witnessed an increased need for reactive actions, healing the network (i.e., recovering the affected communications services) after failure occurrence. In this view, network survivability, which is the ability of a network to recover traffic affected by failures, has gained a critical role in the design of modern telecommunications network. A lot of research has already been done on the design and analysis of survivable network architectures, as well as recovery protocols for the different transport technologies (see e.g. [1] for an excellent survey). As the protection and restoration techniques today implemented and under analysis are mainly addressing a single technology/network layer, the study of new solutions taking into consideration an overall system (ATM, SDH and optical) is the preliminary step to further development in this field (see [1][7]). This paper presents results from the ACTS PANEL research project [8], which studies the integration and coordination of multiple recovery schemes active in different network layers. Some objectives for an integrated approach to multi-layer survivability can be • to avoid contention between the different single-layer recovery schemes, *

Research Assistant of the Fund for Scientific Research (F.W.O.-V.) - Flanders (Belgium)

• to promote cooperation and sharing of spare resources, • to increase the overall availability that can be obtained for a certain investment budget, • to decrease investment costs which are required to ensure a certain survivability target. Past research (see e.g., [1][3][4]) already identified that so-called escalation between recovery schemes active in different layers can enhance the survivability. The recovery interworking between layers starts from the viewpoint that a multi-technology network consists of a stack of single-layer networks (each technology is represented by a network layer). A client/server functional relationship holds between adjacent layers of the multi-layer stack, which consists in each lower network layer providing transport functionality to higher client layers [15]. In each layer network (that incorporates the necessary flexibility for rerouting) a single-layer recovery scheme may be deployed. The presence of multiple single-layer recovery mechanisms in a network may have several causes: • Recovery schemes residing in lower layers (e.g., optical) often enable a more effective recovery of burdensome failures like cable cuts, while failures of higher layer equipment (e.g., an ATM DXC) may demand additional resilience in these higher layers. • Differentiation of service reliability requirements (e.g., for different demand classes) may result in installation of the recovery schemes closer to the layer where traffic is actually injected in the transport network. Moreover, traffic is generally injected at several network layers (e.g., ATM services, SDH 150 Mbit/s leased lines). • The natural evolution of telecommunications networks may result in adding new survivable layers to the existing ones (for example, optical layer survivability). To optimize the overall recovery performance, some coordination is required between these different recovery schemes. We denote this coordination as interworking or escalation between layer networks.

support of client layer spare resources

recovery inter-working strategies

server layer connections Recovery at lowest layer Recovery at highest layer

Recovery at multiple layers

Sequential activation • bottom-up

• top-down • diagnostic Parallel activation

recovery layer assignment • pre-planned

• dynamic

ed ar sh

gle n i s

pre-palnned vs. dynamic route calculation

• separated from native

rt po p su

centralised vs. distributed

• unprotected at server layer

shared vs. dedicated spare resources

• protected at server layer

link vs. path based

ted ca i d de

rt po p su

ns o i t op y r ve o ec rr e y la

Figure 1: Framework for multi-layer survivability

2. Framework for multi-layer survivability This section presents the framework that has been worked out by PANEL for multi-layer survivability strategies. The framework covers the different options in multi-layer survivability, and sets out a number of alternatives that may be followed by a network operator. It is summarized in figure 1 and discussed hereafter in more detail. The options basically fall in two categories, i.e. options pertaining to the single-layer recovery systems and options pertaining to multi-layer interactions. The singlelayer recovery options as such have been studied extensively in the scope of single-layer survivability [1]. The multi-layer options are not well known yet. The following challenges have to be addressed by a network operator facing a multi-layer environment: • For each failure it has to be determined which recovery scheme – at which layer – will be applied. In a multi-layer network, it may be that several schemes and/or layers are responsible for the overall resilience against a certain failure type.

• The responsibility of each survivable layer (i.e., a layer equipped with a recovery scheme) in the overall recovery process has to be delineated. This may require special functionality in the equipment. More in particular, it must be ensured that a recovery at a certain layer is not activated for a failure that is supposed to get resolved at another layer. Indeed, such activation may result in competition for network resources, leading to network congestion and other unwanted behavior of the network. • When the responsibility for resilience is distributed across network layers (so-called escalation), the recovery actions of the single layer mechanisms have to be coordinated. For instance, it may be that a client layer has to wait until its server layer has restored enough client layer spare capacity before it can start its recovery. • Because spare resources are required in every layer where a recovery scheme is active, several ways exist to combine the spare capacity pools present in the different layers. More in particular, the way a server layer supports the spare capacity of its client layer(s) has been identified as yet another option of multi-layer survivability strategies. The following sections present the major „dimensions“ of our framework for multi-layer survivability. These are the recovery approach, the escalation strategy, the recovery layer assignment and the server layer support of client layer spare capacity. For each of them, we discuss some relevant options that can be taken. We refer to the literature for an explanation of the single-layer recovery options, which can be chosen differently for each network layer. Of course, it may still be required to evaluate whether certain combinations of single-layer options are feasible from the viewpoint of investment cost and recovery performance.

0 2.1. Recovery interworking strategies: recovery approaches To define a multi-layer recovery approach, the following question is raised: „For each failure, which layer networks are responsible for its recovery?“ Recovery at the lowest layer A first recovery approach, denoted as recovery at the lowest layer, is to recover the affected network connections in the lowest possible layer. Hereby, the survivability is provided as close as possible to the origin layer of the failure. Because of the coarser switching granularity of lower layers, such an approach is generally simpler in the number of affected connections to be rerouted. In contrast to other recovery approaches, recovery at the lowest layer generally enables to restrict the recovery actions to nodes closer to the failure origin. If a failure occurs only in one layer (e.g., cable cut), recovery at the lowest layer resolves it using a single recovery scheme. However, when „recoverable failure set“ also contains failures affecting several layers at the same time (e.g., complete site breakdown affecting both SDH and ATM equipment), several schemes will have be activated (see [9]). Due to the coexistence of multiple schemes, some interworking functionality needs to be provided to correctly assign and coordinate the recovery responsibilities of each recovery scheme [9]. For instance, the end nodes of disrupted network connections at a client layer can generally not distinguish a client layer failure from a server layer failure. Indeed, in the event of a server layer failure, subsequent alarm messages will be passed through each layer after they are generated (see [10]). These may trigger the client layer recovery scheme. Because the network is planned to resolve server layer failures „at the lowest layer“, this may create a situation for client layer resource competition. In order to activate the client recovery mechanism only in case of client layer failures, some interworking functionality needs to be provided (see section 2.2). A second consequence is the presence of multiple spare capacity pools in the network. Indeed, every survivable network layer requires its own spare resources for the rerouting of affected connections. Due to the independent operation of different layers, a single-layer recovery scheme can not draw directly upon spare resources in other network layers. The spare capacity pool of a higher layer has to be supported through lower layer paths and thus imposes additional investment costs for all layers below (see [11]). If the network is designed in a negligent way, multi-layer survivability may thus require a serious amount of resources. However, a new spare capacity concept, called common pool survivability [11], has been devised that provides the means for an inter-layer spare capacity pool (see section 2.4).

Recovery at the highest layer A second multi-layer recovery approach, denoted recovery at the highest layer, consists in recovering disrupted traffic in a layer closer to the origin of the traffic. A higher layer recovery scheme is (in principle) able to resolve failures occurring in all layers below. For a certain traffic, this approach ensures service resilience against different failures (at the same layer and in lower layers) with a single survivability scheme, i.e. a scheme residing in the highest layer. Providing resilience at the highest possible layer has the following advantages: • A transport network often carries several service classes, each with a different reliability requirement. It is often easier to provide multiple reliability degrees when the survivability schemes reside in higher layers. • Since a single recovery scheme then suffices to provide protection against failures occurring in every layer involved in the transport of the traffic, the implementation complexity of interworking (see also further) between different schemes can be avoided. In this case, interworking merely comprises activating the responsible recovery scheme as fast as possible at the appropriate layer network. • Because the granularity of the higher layers is finer, relatively less spare capacity can be needed (from a single-layer point of view) compared to lower layer recovery. Nevertheless, this does not imply that also from an overall cost point of view, highest layer recovery is cheaper than lowest layer recovery (see e.g. [12]). On the other hand, the finer switching granularity of the higher layers may complicate the rerouting in event of lower layer failures, because many entities are then affected at the same time. For instance, a cable cut may involve a very complex failure scenario at a higher layer, since many paths have to be rerouted. This will probably slow down the recovery time. Secondly, when the server layer paths supporting the higher network layer are rather long, the recovery process may involve reconfiguration in network elements far away from the original failure cause. Most probably, special precautions will have to be taken to adapt the survivability scheme of the higher layer (to solve the lower layer failure scenarios). For example, with restoration at the ATM layer it is envisaged to assemble VPs sharing physical routes into VP groups as a way to reduce the recovery „efforts“ in case of physical failures. In addition, the network design phase has to ensure that the spare and working resources of the higher layer are physically disjoint. This may lead to a higher capacity requirement, because the routing (and rerouting) constraints of higher layer paths are much more severe than in the lowest layer approach (see [12]). Recovery at multiple layers A third approach, denoted recovery at multiple layers, could be to recover disrupted traffic not within a single layer network but distributed at multiple layer networks. This approach tries to combine the advantages of both previous recovery approaches. For instance, since spare capacity is required in the client layer to cope with client failures, less spare capacity might be required for server layer failures (who are partly recovered at the client layer). The joint optimization of spare capacity of client and server layer may lead to some distributed (between the two layers) capacity allocation. Another advantage is that it may also ease the tailoring of the survivability requirements to different service types and needs. The obvious drawback is the increased complexity of this recovery approach. Let us consider an example where a server link failure is resolved in a distributed way by a client and server layer. The server layer will probably recover most of the disrupted server traffic is recovered, but the remaining – affected– server paths are propagated to the client layer introducing consequent client link failures (probably more than one). These secondary client link failures should then recovered at the client layer. Similar as in the previous recovery approach, the interworking functionality is required to activate the appropriate recovery scheme as fast as possible at the appropriate network layer. If the spare resources at the client layer are not affected by the expected single server root link failure (i.e., working and backup route are physically disjoint), client layer recovery of the appropriate traffic (not protected by the server layer) can be started immediately. If the client spare „routes“ are disrupted, the activation of the client layer scheme has to be delayed until the server layer restores the necessary client layer spare resources. A well-designed interworking strategy (see section 2.2) is thus essential

for an effective operation of recovery at multiple layers. In general, the increased complexity and the bottom-up activation will however reduce the recovery speed compared to the previous approaches.

1 2.2. Recovery interworking strategies: interworking - escalation In a multi-layer network, several recovery schemes may be involved to resolve certain failures. This is the case for all failures in case of the recovery at multiple layers, but it is also the case for other recovery approaches in case of failures affecting multiple layers (e.g., complete site failures). An interworking strategy (also called escalation strategy [3][4]) consists of a set of rules describing when to start, stop and how to efficiently coordinate the activities of the different recovery schemes. Two options have been identified concerning the order of activation: starting two or more recovery schemes in parallel or starting them sequentially. The (independent) parallel strategy is fast and requires no communication or coordination between the schemes. The sequential strategy may lead to longer overall recovery times than parallel activation but is generally easier to keep under control. A sequential strategy determines the order of activation of the schemes and provides coordination between the schemes (ensuring that schemes are activated at an appropriate moment). Currently, coordination mechanisms such as hold-off times and/or specific OAM messages (called inter-layer recovery tokens) are envisaged [13]. Coordination between the activated recovery schemes in a parallel interworking approach may or may not be present. If the working and spare resources in the client layers are physically disjoint and there is enough spare capacity in the client layer to recover from server layer failures, recovery in the client and server layers may proceed in parallel without interaction. This approach is denoted independent parallel interworking, and is clearly simple and robust. If the working and spare resources are not disjoint, recovery mismatch between the schemes may occur (e.g., both server and client layers recover the same disrupted traffic) and lead to sub-optimal net results. Coordination may be useful here. This approach is therefore denoted interdependent parallel interworking. Note, in order to solve some of the mismatch problems, TMN may play a decisive role [14]. The sequential interworking approach does prevent blocking problems between recovery mechanisms as only one mechanism is active at a time. On the other hand, it may be slower and requires more coordination, and hence messaging, to stop mechanisms and trigger others. This coordination may be implemented in a distributed or centralized (e.g., via TMN) way. Two questions need to be answered in case of sequential interworking: where to start the recovery and when to „escalate to the next layer“. The question of where to start results in three approaches: bottom-up, top-down and in between (diagnostic). The first two approaches start at a well-defined layer. The bottom-up interworking starts at the layer network which is closest to the failure assuring a very quick activation of the recovery mechanism; the upwards escalation may take place when the recovery in a layer fails, upon the expiration of a time out or by intervention of the TMN. The topdown interworking always starts at the highest layer network and escalates downwards. This permits to recover the highest layer connections as first, allowing a network operator to offer to its customers’ connections with specified protection priority classes. The decision to start at a specific layer depends in diagnostic interworking among others on the received alarms, and previously gathered survivability statistics. When to escalate may be based on either time-out or diagnostic methods. Diagnostic methods take for example into account the ratio of recovered to disrupted connections, the elapsed time and the remaining quantity of spare resources. 2 2.3. Recovery interworking strategies: recovery layer assignment The pre-planned recovery layer assignment pre-determines which disrupted network connections are recovered at the different layer networks for a given failure, whereas in the dynamic recovery layer assignment this is determined on-line. There is a trade-off between complexity and flexibility. Consider for instance a failure disrupting two SDH paths (denoted A and B) transporting ATM VP traffic. In case of pre-planned assignment, path A is for instance always recovered at the SDH VC-4 layer whereas the remaining disrupted traffic (carried by path B) is always recovered at the ATM VP layer. In case of dynamic assignment either path A or path B is recovered at the SDH VC-4 layer, whereas the remaining traffic is recovered at the ATM VP layer. The choice between A and B may for instance depend on the spare capacity present in both layers (which may not be available at the network elements) or may be random (e.g., as an indirect consequence of a dynamic distributed flooding scheme in combination with hold-off time).

3 2.4. Support of client layer spare resources The previous sections revealed that in a multi-layer environment, there exists a wide variety of recovery approaches. Our research has also discovered that in a network with different survivable layers, new concepts are possible (and necessary) with respect to the spare resources. These new concepts enable the deployment of multi-layer survivability at a reasonable investment cost. As a matter of fact, transmission and switching resources are very expensive and optimization of the spare resources is thus very important for network operators (and indirectly for customers as well) since it results in a containment of the network costs and consequently also of the service tariffs. When the spare resources required for multi-layer survivability would involve an unreasonable investment budget, operators will not be urged to pay for the higher reliability degree eventually achieved by it. Secondly, in a network with multiple survivable layers, specific attention has to be paid to avoid the protection of the client spare resources by a server layer recovery scheme. This redundant protection seems neither necessary nor useful, but is often a result of a too naïve provisioning of the multi-layer spare resources. In the past, single-layer recovery methods have been classified based on the fact whether the recovery use dedicated backup facilities (so-called protection schemes) or whether the spare resources are shared between different working connections (so-called restoration scheme that dynamically allocates the backup facilities). On a single-layer basis, the spare resources can thus be either shared amongst all connections (under supervision of the recovery scheme) or dedicated to a certain connection. Generally, sharing the spare resource allows a more effective resource allocation and utilization than with dedicated spare resources, but the control algorithm is also more complicated. We have discovered that a similar classification can be made in the context of multi-layer recovery, depending on whether the spare resources of a certain layer are dedicated to the connections transported in that layer, or whether the spare resources are shared between connections transported in different layers. This second option requires innovative design approaches, but leads to a much more efficient usage of the expensive spare resources - somewhat in the same way as single-layer restoration enables a more efficient usage of spare resources than single-layer protection. Our study on achieving an inter-layer sharing of spare resources resulted in a new survivability concept, called common pool. The basic idea is to support the client spare resources as unprotected preemptible traffic (also known as extra traffic [16]) in the server layer. This results in the sharing of server spare resources with the client recovery scheme. Hence, it enables an inter-layer spare capacity pool as opposed to a traditional provisioning approach with spare resources pools dedicated to a certain network layer. As indicated by [11], common pool survivability can yield considerable cost savings as compared to a traditional multi-layer survivability approach.

3. Outlook - Conclusions This paper has presented results of the PANEL research project, which studies the survivability in multi-layer transport network. The major challenges of multi-layer survivability are to determine which layer(s) is responsible for which failure (recovery approach), the coordination of the singlelayer recovery actions (escalation strategy), to design the multiple spare resource pools in a careful way. These are addressed by the different options in our framework for multi-layer survivability. In order to provide overall service resilience, operators can determine a suited strategy based on this framework. It is an impossible task to state the optimal strategy, since this will differ on a case-to-case base. In addition, one may choose different strategies for different failure types. In general, the set of failures is divided in a set of expected failures (i.e., most probable failures such as cable cuts) that an operator needs to resolve to assure sufficient service availability. The remaining failures, called unexpected failures, are also important to get resolved, but are for budget reasons generally not considered in the network design (more specifically, during the allocation of spare resources). This means that the spare capacity pools – deployed for the expected failures – will generally not allow full recovery of such unexpected failure. While the recovery at multiple layers may seem too complex for the expected failures, this approach may better exploit the spare resources available in the network (e.g., restoring in each layer as much as the available resources allow). As such, the recovery at multiple layers (and the subsequent escalation strategy) is very well qualified as a „second line of defense“ against unexpected – catastrophic – failures.

Some other observations – that can be made on a general basis – are the following. Recovery at the lowest layer seems the most suited approach for a fast and effective recovery of the more burdensome failures like cable cuts. In fact, this is proved by the growing interest in optical recovery, which becomes inherently attractive as the network throughput increases [18]. On the other hand, recovery at the highest layer may be better suited when the reliability requirements are very differentiated and should be tailored to each client. In fact, when client and server network layer are operated by different companies, cases may exist where the client layer operator prefers using his own survivability scheme over unprotected server paths instead of relying on the higher „availability“ of protected server paths. At any rate, the actual comparison between the approaches also depends on the particular network and traffic model, and the single-layer recovery options that have been taken. An operator may also be restricted by the survivability schemes currently installed in the network, which may not allow certain innovative recovery approaches. A definitive comparison between different solutions has to be validated by results from a network evaluation process, where the approaches are measured in a more quantitative way. More specifically, it is needed to evaluate them in terms of investment cost and recovery performance. This can be done using computer-aided network design and simulation tools.

Acknowledgement This work has been supported by the European Commission through the ACTS project AC205/PANEL.

References [1]

T.-H. Wu, „Emerging technologies for fiber network survivability“, IEEE Communications Magazine, vol. 33, no.2, pp. 58-74, 1995.

[2]

D. Johnson, „Survivability strategies for broadband networks“, In Proc. of IEEE Globecom’96 conference, London, 1996.

[3]

J. Manchester and P. Bonenfant, „Fiber Optic Network Survivability: SONET/Optical Protection Layer Interworking“, In Proc. of NFOEC’96, Denver, CO, 1996.

[4]

L. Nederlof et al., "End-to-end survivable broadband Networks", IEEE Communications Magazine, vol. 33, no. 9, pp. 63-70, 1995.

[5]

T. Noh, „End-to-end self-healing SDH/ATM networks“, In Proc. of IEEE Globecom’96 conference, London, 1996

[6]

P. Veitch, D. Johnson and I. Hawker, „Design of resilient core ATM networks“, In Proc. of IEEE Globecom ‘97 conference, Phoenix, 1997.

[7]

T.-H. Wu and N. Yoshikai, „ATM Transport and Network Integrity“, Academic Press, 1997.

[8]

P. Demeester et al., „PANEL - Protection across network layers“, NOC ‘97, European Conference on Networks and Optical Communications, Antwerp, Belgium, 1997.

[9]

K. Struyve et al., „Design and evaluation of multi-layer survivability for SDH-based ATM networks", In Proc. of IEEE Globecom ‘97 conference, Phoenix, 1997.

[10] K. Van Doorselaere et al., „Fault propagation in WDM-based SDH/ATM networks“, Brugge, Belgium, 1998. [11] M. Gryseels et al., „Common pool survivability in ATM on SDH ring networks“, DRCN98 Workshop, Brugge, Belgium, 1998. [12] M. Gryseels et al., „A cost evaluation of service protection strategies in ATM on SDH transport networks“, DRCN98 Workshop, Brugge, Belgium, 1998. [13] A. Autenrieth et al., „Simulation and evaluation of multi-single layer broadband networks“, DRCN98 Workshop, Brugge, Belgium, 1998 [14] A. Jajszcyk, C. Brianza and D. Janukowicz, „TMN-based management of multi-layer communication networks“, DRCN98 Workshop, Brugge, Belgium, 1998.

[15] ITU-T Recommendation G.805, „Generic functional architecture of transport networks“, 1995. [16] ITU-T Recommendation G.841, „Types and characteristics of SDH network protection architectures“, November 1997. [17] K.-I. Sato, „Advances in Transport Network Technologies, Photonic Networks, ATM and SDH“, Artech House, London, 1996. [18] O. Gerstel, „Opportunities for optical protection and restoration“, Proc. of OFC’98, San Jose, 1998.

Suggest Documents