ACTS Guidelines Scope Sheet Information - Semantic Scholar

19 downloads 200 Views 154KB Size Report
1:1: Semi-dedicated (by extra traffic) facility restoration, the signal is only ... the server layer is called the client (or higher) layer (for instance the ATM layer is a ...
Towards Resilient Networks and Services

NIG-G5/0699

ACTS GUIDELINE NIG-G5

TOWARDS RESLIIENT NETWORKS AND SERVICES

Editor Panos Georgatsos ([email protected]), Algosystems S.A., Greece Yves T’Joens ([email protected]), Alcatel, Corporate Research Center, Belgium Contributing Projects: COBNET, EXPERT, PANEL, REFORM, UPGRADE Authors A Autenrieth - Technical University Munich C. Brianza - Italtel L. Castagna - SIRTI R. Clemente - CSELT P. Demeester - University of Ghent IMEC P. Georgatsos - Algosystems S.A. L. Georgiadis - ICS-FORTH A. Geyssens - BELGACOM D. Griffin - UCL -UK M. Gryseels - University of Ghent IMEC Y. Harada - NTT A. Jajszczyk - ITTI D. Janukowicz - ITTI

ACTS Guideline NIG-G5

E. Makris - NTUA D. Manikis - NTUA C. Mas – EPFL S. Otha - NTT M. Potts – Ascom M. Ravera - CSELT A. G. Rhissa - INTF G. Signorelli - SIRTI K Struyve - University of Ghent IMEC Y. T’Joens - Alcatel K. Van Doorselaere - University of Ghent IMEC

Page 1

Towards Resilient Networks and Services

NIG-G5/0699

Contents 1.

EXECUTIVE SUMMARY ......................................................................................... 3

2.

RATIONALE AND OBJECTIVES................................................................................ 3

3.

NETWORK FAILURES AND SURVIVABILITY STRATEGY FORMULATION 4 3.1. NETWORK FAILURES AND THEIR IMPACT .................................................................... 4 3.2. SURVIVABILITY STRATEGY FORMULATION ................................................................. 5 3.3. OBJECTIVES ................................................................................................................ 5 3.3.1. Operational Goals ............................................................................................. 5 3.3.2. Influencing Factors............................................................................................ 6

4.

TERMINOLOGY AND TAXONOMY ........................................................................ 6 4.1. 4.2.

SINGLE LAYER NETWORK RECOVERY MECHANISMS .................................................. 6 MULTI-LAYER NETWORK RECOVERY MECHANISMS .................................................. 9

5.

SURVIVABILITY AT THE IP LAYER........................................................................ 9

6.

SURVIVABILITY AT THE ATM LAYER ................................................................ 10 6.1. 6.2. 6.3.

WHY ? ...................................................................................................................... 10 ISSUES ...................................................................................................................... 11 RESULTS ................................................................................................................... 12

7.

SURVIVABILITY AT THE SDH LAYER.................................................................. 13

8.

SURVIVABILITY AT THE WDM LAYER ............................................................... 13 8.1. 8.2.

9.

ISSUES ...................................................................................................................... 13 RESULTS ................................................................................................................... 14

MULTI-LAYER SURVIVABILITY............................................................................ 14 9.1. 9.2. 9.3.

WHY ? ...................................................................................................................... 14 ISSUES ...................................................................................................................... 16 RESULTS ................................................................................................................... 17

10. INTEGRATION OF SURVIVABILITY WITH OTHER NETWORK FUNCTIONS.......................................................................................................................... 19 11.

SURVIVABILITY ARCHITECTURES AND MEANS ......................................... 20

12.

ON-GOING STANDARDISATION......................................................................... 21

12.1. 12.2. 13.

ATM FORUM........................................................................................................ 21 ITU–T .................................................................................................................. 21

CONCLUSIONS ........................................................................................................ 21

ACTS Guideline NIG-G5

Page 2

Towards Resilient Networks and Services

1.

NIG-G5/0699

Executive Summary

The introduction of optical, SDH and ATM broadband technology in transport networks concentrates more traffic on fewer network elements and makes service support increasingly vulnerable to network failures. Network resilience (survivability) is a crucial ‘must have’ feature of future network infrastructure in order to ensure the integrity of the different services which it supports. Network resilience refers to the ability of a network to recover services affected by failures that may be encountered during its operation. The issue has started to play a critical role in the design of modern telecommunications networks in recent years. The growing awareness of this issue is reflected in the number of studies reported in the literature and the work currently being undertaken in various standardisation organisations (ITU-T, ANSI) and international activities (ATMF). The guideline reviews the work of the ACTS programme on network survivability. It starts by classifying the various types of network failure and their impact on services and users, identifying the resilience requirements of different types of service and user. It then presents the elements of a survivability strategy and a classification of survivability schemes. Individual sections then discuss how to implement such schemes in the IP, ATM, SDH and WDM layers of a multilayer network. A framework for multi-layer survivability is presented, together with recommendations on how to use it. The final sections consider the integration of survivability with other network functions and QoS based survivability. The analysis and recommendations draw on practical experience gained through implementation and trials. A comprehensive state-of-the-art review of relevant research and standardisation efforts is also provided

2.

Rationale and Objectives

Successful deployment of telecommunication services in a competitive nationwide context requires cost-effective means for ensuring the integrity of the services under fault conditions. Cost-effective schemes for coping with network failures have been given close attention by all major players in the telecommunications world. The introduction of optical, SDH and ATM broadband technology in transport networks concentrates more traffic on fewer network elements and makes service support increasingly vulnerable to network failures. A break in a cable equipped with terabit/s optical transmission systems can disrupt the equivalent of 250 million telephone calls at once; especially in long-distance transport networks, where the probability of cable cuts is not negligible. (For example, in a 25000 km long Pan-European network a cable cut is statistically likely every four days). Furthermore, the connection-oriented nature of various network technologies (e.g. ATM), used for provisioning services of guaranteed quality, makes finding cost-effective solutions to withstand network failures more urgent. Once a failure has occurred along the set-up route of a connection, the subsequent traffic flow within the connection will inevitably be lost. Considering the everincreasing economic and social significance of communications services to society, the basis of the information infrastructure must be robust. Operators have to assure a high availability of their services, under both normal and fault conditions, particularly for customers whose communications are of vital importance. Preventive actions such as fire safety plans, armoured cables, etc. are able to reduce the failure occurrence frequency. Carefully planned repair scenarios, using state-of-the-art measurement techniques, detailed and accurate databases about equipment and cable locations and welltrained technicians, can ensure a fast repair (minimising the time communications would be ACTS Guideline NIG-G5

Page 3

Towards Resilient Networks and Services

NIG-G5/0699

disrupted). However, such measures are insufficient for operators (and their customers) for whom even the rarest disruption of communications may cause a substantial loss of revenue. As a result, in recent years, there has been an increasing need for automated reactive actions for healing the network (i.e. recovering the affected communications services) after failures. In this context, network resilience (survivability) is a crucial ‘must have’ feature of future network infrastructure in order to ensure the integrity of the different services supported by the network. Network resilience refers to the ability of a network to recover the services affected by failures that may be encountered during its operation. The issue has started to play a critical role in the design of modern telecommunications networks in recent years. The growing awareness of this issue is reflected in the number of studies reported in the literature and the work currently being undertaken in various standardisation organisations (ITU-T, ANSI) and international activities (ATMF). The research and standardisation efforts mainly concentrate on the design and analysis of suitable network architectures and related schemes/mechanisms for recovering network services at fault occurrences for different network technologies (see [1], [52] for a survey). The main theme of this guideline is network survivability. It consolidates the relevant work in ACTS projects is the main theme of this guideline, analyses the key issues and presents recommendations on how to implement survivability. The analysis and recommendations draw on practical experience gained through implementation and trials. A comprehensive state-ofthe art review of relevant research and standardisation efforts is also provided. The guideline is aimed at network operators and equipment manufacturers interested in the realisation of survivable networks.

3. 3.1.

Network Failures Formulation

and

Survivability

Strategy

Network Failures and their Impact

Network failures can be classified according to certain criteria. The first distinguishes between network entities, that is to say between link and node failures. The second distinguishes between hard and soft failures. Hard failures occur when the signal is totally lost; this typically includes catastrophic failures. Soft failures occur when the signal is degraded (e.g., QoS contracts are no longer met). These individual failures can occur as single failures and multiple failures: A single failure means that, once the failure has been detected, no other failure occurs whilst the network tries to recover its services. Multiple failures mean that a number of failures occur simultaneously - even whilst the network tries to recover. Although these terms are often used in standardisation and technical papers, no explicit definition could be found. When failures occur in the network, the effects are: • loss of existing connections on the failed part of the network part, • blocking of call set-up attempts due to lack of network resources, • high call re-attempt rate by users who have lost their services • loss of goodwill from the network users. Depending on the nature of the service level agreement, the network operator generally offers users compensation for these effects. Note that the perception of network failures and its economic and/or social impact is not an objective measure of the network technology, but depends more on the individual service and service user category. The aim of a survivable network is to be able to cope with failures and recover the active affected services.

ACTS Guideline NIG-G5

Page 4

Towards Resilient Networks and Services

3.2.

NIG-G5/0699

Survivability Strategy Formulation

Throughout this Guideline, the term survivability strategy refers to the policy adopted by the network operator and the resulting set of functions that should be incorporated into network elements to withstand a selected set of network failures. The term survivability or recovery or restoration mechanism refers to the particular technology and (protocol) operation employed in the network to actually restore the integrity of the existing services, once a failure has been encountered. The survivability strategy has several aspects [37]. In general, the objective, operational goals and influencing factors of a survivability strategy are independent of network technology (e.g. WDM/SDH/ATM/IP).

3.3.

Objectives

The main objective of any survivability strategy is to withstand, in a cost-effective manner, failures in transmission systems, and to reconfigure user connectivity without the active participation of the end system. The key aim is that the user should not be bothered with reconfiguration of the network topology, whether this is due to congestion within the network or failure of transmission elements. 3.3.1.

Operational Goals

The restoration speed, i.e. the speed of the restoration mechanisms employed by the network to recover the affected services, is the most relevant performance measure from the end-user’s point of view. The factors contributing to the restoration speed are:. 1. Alarm detection/notification: this is the elapsed time between the actual occurrence of the failure and the notification of the alarm to the entity that will take the decision on how to react to the alarm. This ‘entity’ can either be centralised or distributed (see section 4). 2. Alarm correlation/triggering of restoration: the entity that collects the alarm notifications has to decide in what way the network will react to the failure. This phase may include a time-out, during which all alarm notifications from different parts of the network (or network elements) that need to be correlated can be received. Based on the outcome of this decision process (one of) the restoration mechanism(s) will be triggered. It is also possible that one instance of the restoration process gets launched for each entity that has to be restored. 3. Execution of restoration mechanism: this can be either the retrieval of a pre-loaded scenario, the calculation of an alternative route based on available data, or the execution of a distributed process based on the exchange of protocol messages. 4. Route validation: in some cases, the restoration route will first be validated before it is activated. 5. Reconfiguration of network elements: when everything is processed, decided and validated, the final step in the restoration process is the actual “switching-over” of the interrupted traffic flows. The time required depends on the number of individual actions that need to be undertaken and the design of the network elements. The survivability strategy must be designed to deliver defined levels of survivability in response to a set of anticipated failure scenarios. These can include single or multiple coexisting soft (Signal Degrade) and hard (Signal Loss) failures (see section 3.1). The cost of a particular survivability strategy can be expressed in terms of the resources needed to recover the traffic affected. These resources, called restoration or protection resources, include: restoration capacity (capacity that may have been reserved in the network for restoration purposes), processing, memory and administrative overhead. The restoration robustness can be defined as the ability to react in a predictable way to unpredictable events. Most restoration schemes are designed with a certain range of failure scenarios in mind. When failures occur outside that range, or when other anomalies occur,

ACTS Guideline NIG-G5

Page 5

Towards Resilient Networks and Services

NIG-G5/0699

such as loss or corruption of data, the restoration mechanisms operate “out of specification”. A minimum requirement for robustness is that they should not make things even worse by taking wrong decisions. A good measure of robustness is that the mechanisms still manage to minimise the impact of failures under such conditions. The manageability of a survivability strategy refers to the ability to operate so that its behaviour and results can be comprehensively managed. 3.3.2.

Influencing Factors

Determining and prioritising the operational goals is not an exact science – there is no simple recipe for coming up with an ‘optimal’ network survivability strategy. A number of factors have to be taken into account. The primary factor is the type of network services and their users. Network users differ according to the economic, security and social impacts of an outage on their activities. An outage can cause a significant loss of revenue to professional users but will have no economic impact on occasional residential users. Parameters determining the tolerable total outage time are not (yet) included within the traffic contracts for network services . Other factors that may guide the choice of a survivability mechanism include network topology, variations in current load and growth, network technology and the standardisation status of available survivability mechanisms. The definition of a suitable survivability strategy involves a trade-off between loss of revenue and the cost of the strategy. The variety of failure instances depends on the network technology used . Operators should clearly specify the operational goals of their network, taking into account these influencing factors and trade-offs, before selecting a survivability strategy. With QoS-based services, it would be sensible to include tolerance to outages in the list of QoS parameters. Different survivability techniques could then be applied for different classes of service. However, there is currently no standardisation activity on this topic.

4. 4.1.

Terminology and Taxonomy Single Layer Network Recovery Mechanisms

The recovery mechanisms used or being studied today generally address a single network layer. Schemes for SDH networks have been already standardized in ITU-T [19] and ETSI [13] and some of them are implemented in existing SDH networks. Schemes for ATM networks are currently being standardised in ITU-T SG13 [12], and work has started in ITU-T SG13 and ETSI TM1 to specify protection schemes for optical networks. The terms protection or spare resources describe resources built into the network for recovering traffic affected by faults. These include network connectivity resources (e.g. VPCs or STM paths) and network bandwidth called spare or restoration capacity. The term protected resource refers to an entity in a given network layer, which is targeted for protection by the survivability strategy. Protected resources are usually the connectivity entities that carry user traffic or bundles of user traffic. E.g. in an ATM layer a VPC or a VCC or a link could be protected resources. Generally, recovery schemes can be classified as centralised or distributed. In centralised schemes, central network functions calculate the alternative routings for traffic affected by network failures. Management plane activities are usually involved. In distributed schemes, the recovery functions are left to the individual network elements and are executed in ACTS Guideline NIG-G5

Page 6

Towards Resilient Networks and Services

NIG-G5/0699

parallel. In this case, the restoration phase is typically dominated by control plane activities. Centralized recovery schemes are implemented as proprietary solutions in most of the existing management systems for ATM, SDH and WDM networks. Distributed restoration schemes are still the subject of research studies. A second level of recovery mechanisms is classified in [12], [56]. • Re-routing is defined as the establishment of appropriate resources to recover affected traffic by network management functions. • Self-Healing is defined as the establishment of appropriate resources to recover affected traffic by the network itself, without the involvement of network management functions. • Protection switching is the establishment of pre-assigned replacement resources by means provided by the equipment itself, without the involvement of network control or management functions. Table 4-1 summarises the main characteristics of the different survivability mechanisms and table 4-2 indicates which recovery mechanisms apply to certain network topologies.

Centralised Restoration (Re-configuration) Centralised

Distributed Restoration (Self-healing network) Distributed

Real time or Pre-calculated Real time

Real time or Pre-calculated Real time

Pre-calculated

Quick (several milliseconds / few seconds) Under research Under research

Very Quick (few milliseconds)

ATM VP layer SDH layers

Slow (several seconds / few minutes) Proprietary solutions Proprietary solutions

Optical layers

Proprietary solutions

Under research

Recovery Mechanism Control Mechanism Alternative Routes Calculation Bandwidth assignment Recovery time

Protection Switching Distributed

Pre-assigned

ITU-T I.630 ITU-T G.841, G.842 ETSI ETS 300 746, DTM 3025, DTR 3041 ITU-T G.872 ETSI DTR/TM1047

Table 4-1: Survey of recovery mechanisms for ATM, SDH and Optical Networks. Centralised Re-routing Mesh

Self-Healing Networks

Distributed Protection Rings

Switching Point-to-point

Table 4-2: Survivability overview. Both re-routing and self-healing recovery schemes can be applied in meshed networks. The restoration technique is supported by spare transmission resources available in the mesh. The alternative routes for recovering the affected traffic can be either statically determined, i.e. calculated and stored in advance of the failure(s), or dynamically calculated, i.e. chosen in the light of the actual network conditions at the time of the failure. These techniques require the identification of spare bandwidth on the network, in order to be able to instantiate a replacement path to recover the affected traffic. The main approaches to identifying and activating these spare resources are flooding and back-up based. In flooding based approaches, restoration is achieved by broadcasting messages (flooding) searching for available resources throughout the network. In back-up based approaches, back-up (pre-determined) protection

ACTS Guideline NIG-G5

Page 7

Towards Resilient Networks and Services

NIG-G5/0699

resources are allocated to active resources. E.g. in ATM networks this can include VPCs with pre-assigned bandwidth and/or VPCs with zero bandwidth. Table 4-3 compares the two approaches.

Restoration rapidity Algorithm and message transmission protocol Number of generated messages Restoration segment

Required spare resources Node failure restoration Ability of process identification and interruption Backup path management Spare resource management

Flexibility against multiple or unforeseeable failure

flooding based Slow (resource searching) Complex

pre-allocated based Fast Simple

Large

Small

Between link terminator (difficult to restore between path terminator) Large Difficult

Between path terminator (any node along path)

Difficult

Not required (only at failure occurrence) Necessary if high ratio of restoration needed High

Small Easy (except failure of restoration pair node) Easy

Required Proper management of spare resource on the backup route is necessary Low

Table 3-3 : Comparison of restoration techniques. Protection switching can be further subdivided by the topology of the protection scheme, which can be either point-to-point, or ring based. In protection switching, the installed back-up resources can be either shared or dedicated. The following classification has been proposed by several authors: • 1:1: Semi-dedicated (by extra traffic) facility restoration, the signal is only rerouted to the spare resource after the failure has occured. • 1+1: Dedicated facility restoration, the signal is sent over both active and standby resource, and the receiver has some local algorithm to choose between the two signals. • m:n : Shared facility restoration, m protection entities shared amongst n working entities (m:n). The simplest example of this kind of protection is 1:n, where one protection entity protects n working entities. Point-to-point protection can be further classified as either diverse or non-diverse path routed. In non-diverse path routing, the protection resource may follow parts of the same path as the protected resources. In diverse path routing, the protection resource follows a different path from the protected resource. Ring based protection schemes provide protection to all the nodes within the ring. Add Drop Multiplexers (ADMs), fitted with external switch-over capabilities, offer restoration functionality in a more economic way than their Digital Cross-connect Systems (DCSs) counterparts. A separate classification is the logical distance over which the affected resources are to be restored. Span restoration, typically restores the link connection between the two network entities that detect the failure. Path restoration on the other hand, restores the end-to-end logical connection.

ACTS Guideline NIG-G5

Page 8

Towards Resilient Networks and Services

4.2.

NIG-G5/0699

Multi-Layer Network Recovery Mechanisms

Different network technologies (ATM, SDH, WDM) each provide different functionality. In modern broadband transport networks these technologies are often present at the same time (see figure 3.1). This is mainly due to the evolution of the services and technologies but also because network operators need to exploit past investments. These network technologies are inter-work and co-operate by means of standardised adaptation functions, which allow for instance the transport of ATM cells inside SDH containers (within an STM-n frames) and the transport of SDH frames inside a WDM channel. Essentially, a broadband transport network can be viewed as a stack of layers (figure 4-1). Each layer is in principle a single technology network that provides transport functionality to the layer above. The transport network becomes therefore, a multi-layer network. The layer that provides transport functionality is called the server (or lower) layer and the layer that uses the server layer is called the client (or higher) layer (for instance the ATM layer is a client for the SDH layer which is a client layer for the WDM layer). Figure 4-1 shows some multi-layer network configurations that could support the provision of IP-based services. IP

IP

ATM

ATM

SDH

IP

IP

SDH

W D M O p tic a l N e tw o rk

Figure 4-1: Multi-layer network configurations. It should be mentioned that each network technology layer can in turn be decomposed into several layers (i.e. ATM VC/VP layers, SDH High/Low Order Path layers, SDH Multiplex/Regenerator Section layers, Optical Channel layer, Optical Multiplex/Transmission section layers). Individual mechanisms to detect and recover from failures may be deployed in each network layer. The overall survivability strategy should harmonise their activities in a cost-effective manner.

5.

Survivability at the IP Layer

Survivability is an inherent feature of the IP protocols as each packet is individually routed through the network. Damaged links can be avoided by the packet by packet routing decisions. However, as a hop-by-hop routing paradigm is generally used, loops may be introduced into the routes during a transient period following the failure, until all routing tables are consistently updated [40]. Loop-free routing information protocols have been proposed [41][46]. The emergence of technologies for QoS support and fast-switching of IP traffic (e.g. RSVP, MPLS), means that survivability schemes are needed for IP networks. These should, for example, handle failures on a statically routed path where RSVP has reserved specific resources to meet the QoS requirements of particular traffic streams, or failures along a LSP (label-switched path) established to ‘short-cut’ IP traffic. Although the issues are understood, they remain open research topics.

ACTS Guideline NIG-G5

Page 9

Towards Resilient Networks and Services

6.

NIG-G5/0699

Survivability at the ATM Layer

Survivability mechanisms at the ATM layer are essential in order to recover from ATM specific failures, such as performance degradation on single VPCs or VCCs,. There are however reasons why other survivability features might preferentially be implemented at the ATM layer.

6.1.

Why ?

As described in section 3, service restoration requires fast, cost-effective and manageable solutions. Restoration at the ATM layer offers a number of advantages over restoration at the physical layer and other underlying network technologies. For certain types of failure, such as QoS degradation, survivability at the ATM layer may constitute a viable solution. On the other hand, restoration at the ATM layer may not be scalable, given that a single cable cut can break as many as more than 300 STM-1 connections at once and the introduction of D-WDM would further increase the number of links affected by cut. Nevertheless, the main benefits of ATM technology with respect to restoration are: Flexibility - Independent route establishment and bandwidth assignment At the ATM layer the routes the defined connections can follow are independent of the bandwidth assigned to the connections. In other network technologies e.g. in STM networks, the routes and the bandwidth of the connections are inseparable. In STM networks, a digital path is established by assigning a time slot of the TDM frame at each cross-connect on the path. Thus, path route establishment and bandwidth assignment are not independent: only fixed bandwidth (not equal to zero) digital paths can be established. On the other hand, the route and bandwidth of a ATM VPC are defined independently, because the route is defined in the Cell Forwarding tables of the cross-connect nodes while the bandwidth is logically defined in the database of the VPC terminator and/or cross-connects as needed. Therefore, ATM offers the advantage that multiple candidate alternative routes might already be set up by a restoration system without allocating bandwidth to them. Specifically, zero bandwidth VPCs can be defined to pre-assign restoration routes. This reduces the restoration time; whilst limiting network resource usage under normal conditions. Compared with SDH, ATM layer protection switching offers the advantages of faster fault/degradation detection, possible protection for node failures, less required spare capacity, and flexibility of providing service protection based on desired marketing strategies. When high-speed broadband data services and LAN interconnection services are gradually introduced into the network, SDH ring transport begins to show its inability to handle this sort of bursty traffic efficiently. In order to overcome this inefficient use of bandwidth for bursty broadband services, working traffic bandwidth management schemes can be based on the ATM VC and VP layers. In-band signalling Another benefit of ATM is its OAM capabilities. SDH paths offer overhead independent channels that can be used for the management of the resources. The channel assigned to the management of the resources has a fixed bandwidth. In ATM, the OAM [58] cells are mixed with the user plane traffic, and can be assigned a variable bandwidth. This greatly increases the speed with which fault management can perform its functions. As a result, restoration messages can inherently use in-band signalling; unlike SDH systems, where embedded overhead channels with limited throughput have to be used for exchanging restoration messages. The signalling channels are thus better suited to supporting burst message transmission, and consequently the objective of reducing message volume and numbers is not so critical as in SDH networks. Channels in SDH are symmetrical and have a fixed bandwidth, depending upon the hierarchical level of the channel. In contrast, ATM channels can be asymmetrical, allowing an arbitrary bandwidth value, which can depend on service type and usage parameters. This of course comes at the expense of an extra level of complexity. ACTS Guideline NIG-G5

Page 10

Towards Resilient Networks and Services

NIG-G5/0699

Reliability of Service class support As ATM has the capability of offering switched services, the standards include specifications for ATM network signalling protocols. It can be expected that implementing reliability of service implies a certain cost to the network provider. For physical layer restoration, there is no way for the user to indicate to the network provider in a dynamic way which services need to be protected. It is therefore very likely that the network provider will over dimension the network. However a signalling protocol could be used to allow the individual user to choose (for every single connection) the desired reliability of service. Early standardisation efforts in this direction are under way in the ATM Forum.

6.2.

Issues

The main issues involved in survivability at the ATM layer relate to the particular type of recovery mechanism to use (see section 4.1). Current work on survivability at the ATM layer mainly concentrates on distributed schemes (self-healing), exploiting the flexibility characteristics of the layer. Section 12 provides an overview of the on-going work in the relevant standardisation bodies. For ATM networks in particular, the self-healing (distributed) recovery mechanisms can further be classified according to whether the protection resources have or have not been pre-determined, and according to the policies for sharing the defined restoration capacity. Another dimension could be the amount of bandwidth to be restored (restorable bandwidth) when restoration is considered at the VP layer. It could be the used, effective or the allocated bandwidth of the VPC. One of the main issues related to survivability at the ATM layer, is the choice of the particular sub-layer (VP or VC). VP or VC layer restoration Switched VCC (SVC) services are provided on top of a pre-established VP layer. Restoration of SVC services can be achieved by considering either VP or VC layer protection. Providing survivability at the VC layer can theoretically involve the restoration of up to 228 individual VCCs following the failure of a single ATM link. This seems to make the restoration process unmanageable. However one should take into account that only part of the full VC space will be used, and only a limited subset of the user population will require automated restoration from the network. If the user can signal the desired reliability class to the network, the network can handle restoration in two distinct ways. The first option is to restore every individual connection depending on its reliability class. Re-routing at the VC layer offers more flexibility, and thereby potential for an optimally used ATM network (note that SVC routes are established on an optimally designed VP layer). The second option is to protect dedicated VPCs to levels matching the priorities of the reliability classes of the traffic they support. However, the segregation of traffic according to different ATM traffic capabilities, QoS classes and survivability classes will require the installation of more VPCs in the network, and a further fragmentation of the resources on the individual physical links. Providing survivability at the VP layer allows very rapid fast restoration of the whole set of VCCs carried upon the VPCs, and it is certainly less complex than the restoration of individual VCCs on the basis of their reliability class. This is also confirmed by the vast amount of literature on VP layer restoration, and the standardisation work within ITU-T [I.630] on VP protection switching methods. Note here that protection at the VC Link layer is obsolete given that a VPC restoration mechanism is in place for this specific set of VCLs. This specific restriction should be borne in mind by the network survivability designer. Of course, VC layer restoration can be seen as a fall back solution for failing VP layer restoration.

ACTS Guideline NIG-G5

Page 11

Towards Resilient Networks and Services

6.3.

NIG-G5/0699

Results

The ACTS project REFORM project [39] has studied survivability at the ATM layer and its coupling with other network availability functions. VCCs vs Hops

6000-7000

time (msec)

5000-6000

7000

4000-5000

6000

3000-4000

5000

2000-3000

4000

1000-2000 0-1000

3000 2000

S3

1000 S2 1

2 test number

3

4

5

hop number

S1

Figure 6-1: 1 VPC, relative influence of number of hops and VCCs (test number: 1=1 VCC, 2=5 VCCs, 3=10 VCCs, 4=15 VCCs, 5=20 VCCs). Figure 6-1 shows the influence of the load of VCCs on a single VPC and the number of hops the VPC spans on the overall restoration speed. These results come from a testbed experiment, prototyping a static shared resource protection switching protocol. This establishes protection VPCs for working VPCs carrying user traffic before any faults but allocates no bandwidth to them. Bandwidth is then negotiated along the protection VPC path when faults occur. Such a scheme allows over-subscription of protection VPCs on individual links, so long as these protection VPCs do not compete for restoration resources following the anticipated failure scenario (e.g. single link failure). The figure shows that the number of active VCCs on the VPC has stronger influence on restoration speed of than the number of hops of the protection VPC. This tendency is to be expected, since for VPC protection switching, the switches at the end points of the VPC need to be reconfigured internally for each failed VCC (this was realised in the prototype environment through external interactions with the switch, not through embedded software). The figure also indicates that the restoration time grows as the number of hops of the failed VPC increases. At each VP connecting point, only bandwidth modification actions were necessary. It is important for ATM switch manufacturers to implement appropriate means to cater for fast restoration. These include: • Reconfiguration of the VC switching matrix, allowing fast ‘switch-over’ of a number of VCCs from one VP termination point to another, so as to avoid a linear increase in restoration speed along the number of failed individual connections. • Reconfiguration of VPC characteristics; the traffic descriptor associated with a VPC, during its creation, cannot be dynamically modified. VPC traffic characteristics, namely bandwidth, need to be dynamically changed to cater for instantiation of back-up VPCs to recover VCCs from affected VPCs. In order to minimise the influence of the hop length of a VPC restoration, it is of importance to pay great attention to the protocol specification which is used to allow activation of the protection VPC resources in a static shared restoration mechanism. Operators should be aware of the scaling properties of the selected survivability strategy. That is to say, the threshold on the connection load, over which the reconfiguration of the ATM

ACTS Guideline NIG-G5

Page 12

Towards Resilient Networks and Services

NIG-G5/0699

layer could become time critical1. The identification of such thresholds may lead to more efficient survivability strategies e.g. to recovery at lower layers, at the cost of introducing multi-layer survivability mechanisms.

7.

Survivability at the SDH Layer

Reliable survivability techniques in SDH networks have reached a certain level of maturity. They mainly rely on automatic protection switching (PS) restoration schemes and on ringbased topologies. Ring-based SDH networks have been deployed. Restricting network topology to a ring simplifies restoration management and cost. Reliable restoration techniques for SDH networks are reviewed in [54]. Protection switching in standard SDH networks can be based in the multiplex section (MS) layer, using K1 and K2 bytes for this purpose. A MS PSRing uses uniform routing so that the working traffic is transported over the bi-directional MS working trails. In the event of a failure, affected traffic is transported over the bi-directional MS protection paths in the opposite direction around the ring. MS protection rings are shared. The total capacity in each STM-N multiplex section is divided equally into N/2 working and N/2 protection channels. Under a protection switching scheme, the working channels numbered 1 to N/2 are switched into the protection channels N/2 + 1 to N. The protection capacity is shared to protect the working traffic carried in the working capacity of any multiplex section in the ring. Subnetwork connection protection (SNCP) is a path layer protection, that can operate at the VC4 level and/or the VC12 level. It is a dedicated protection scheme, which can be used in different network structures; meshed networks, rings, etc. This is dedicated 1+1 or 1:1 protection, in which the working traffic and the protection traffic at the transmit end of a SNC are transmitted in two separate ways. The 1:1 dedicated protection would be able to support secondary (extra) traffic, but is not much used because of the need for protection signaling between both ends. A simpler scheme uses 1+1 dedicated protection and single-ended switching. In this case, the transmit end is permanently bridged, where the traffic is transmitted on both the working and protection subnetwork connections. At the receive end of the subnetwork connection, a protection switch is effected by selecting one of the signals based on purely local information. No PS protocol is required for this protection scheme if it uses single-ended switching. In the case of dual-ended protection switching, 1:1 protection switching, a PS protocol is required to co-ordinate the local and remote switch and bridge operations. This may require a sub-layering technique, and needs further study. SNC protection does not limit the number of NEs within the SNC/NC. The ACTS project MISA [59] project has been concerned with integrated fault management of SDH and ATM networks for the purpose of providing connectivity services transparently of the underlying network technology through multiple network administrations.

8.

Survivability at the WDM Layer

8.1.

Issues

Fault management of optical networks is complex and this is compounded by to the huge quantity of information that can be lost in a second. Survivability at the WDM layer is a topic of current research.

1

Imagine 32 fibers being cut at once, each one carrying 64 lambdas, each lambda carrying an OC-192 signal.

ACTS Guideline NIG-G5

Page 13

Towards Resilient Networks and Services

NIG-G5/0699

So far there is no ITU-T official standard regarding protection switching for optical transparent networks, but the issue is being discussed in relevant study groups (ITU-T Study Group 13). The proposed protection switching strategies are based in the so-called Optical Layer, not yet standardised by ITU-T. The current draft descriptions of the Optical Layer suggest that its upper layer, the Optical Path layer, is divided into Optical Sections, which are divided into Fibre Sections. Protection switching can be done at any of these sub-layers, but complexity increases as we move upwards in the stack of the sub-layers. There is general agreement that switching must be carried out when input power failure is detected, and this process should only involve the Optical Layer. For example, in a 1+1 Fibre Section protection scheme, input power failure must allow optical switching from the working link to the protection link without the aid of the Regenerator Section layer. But the process becomes much more complicated in the Optical Path layer, when an optical path must be rerouted due to a fibre cut. The OAM issue is common to the upper and lower sub-layers. Some proposals advocate the use of the Regenerator Section layer and some bytes of the Regeneration section overhead to perform Optical Section OAM functions, but there are also proposals to maintain everything within the Optical Section. Another issue of concern is the interoperability of the fault management functions, including restoration, at the WDM layer with those of overlying network layers (e.g. SDH, IP). When a fault occurs, the system is overloaded with many alarms; alarm filtering and synchronisation of restoration mechanisms at different layers become critical issues. The issue is discussed further in section 8.2.

8.2.

Results

The ACTS [55] projects PHOTON, METON, MEPHISTO, UPGRADE, MOON, COBNET are concerned with aspects of fault management at the WDM layer. UPGRADE has developed OAM functions based on a 1510 nm supervision channel (according to ITU-T recommendations). Any fault along the transmission line can be detected and precisely located immediately. COBNET has looked at management functions at the WDM layer and has specified the information flow between WDM and client layers for ensuring interoperability. MOON has specified information models for WDM networks which include fault management aspects.

9.

Multi-Layer Survivability

9.1.

Why ?

As discussed in section 4.2, the transport networks of telecom operators are moving towards a configuration where different technologies with intrinsic recovery capabilities will lie on top of one other. Some of the reasons that may lead operators to deploy multiple single-layer recovery mechanisms in its network are: • Recovery schemes implemented in lower layers (e.g. in the SDH or optical layers) often allow more effective recovery of troublesome failures such as cable cuts, but are not able to solve failures occurring in a higher layer. For instance, an SDH recovery scheme may not restore the ATM connections disrupted by the breakdown of an ATM equipment: Such failures demand additional resilience in the higher layers. • Traffic is generally injected at several network layers (e.g. at the ATM layer to provide ATM services, at the SDH high order or low order path layers to provide leased lines services). Differentiation of service reliability requirements (e.g. for different service classes) may result in the deployment of the recovery schemes closer to the layer where traffic is actually injected in the transport network. ACTS Guideline NIG-G5

Page 14

Towards Resilient Networks and Services •

NIG-G5/0699

The natural evolution of telecommunications networks may result in adding new survivable layers to the existing ones (for example, optical layer survivability).

In a network with multiple recovery schemes, it is still possible to tackle the network resilience issue independently in each layer network (as is often done today). However, this approach may lead to inefficient solutions in terms of network cost or network recovery performances. For instance, if the planning of the protection resources of the SDH layer does not take account of the design of protection resources in the ATM layer, protection resources could be allocated to protect traffic already protected in the ATM layer (figure 9-1). This may lead to the installation of twice the amount of restoration capacity strictly needed to protect the traffic and increase network costs unneccessarily. layer service unprotected layer ATM protected layer SDH geographical location

Figure 9-1: Protected ATM layer on protected SDH layer: example of waste of spare capacity in case of separated planning of single-layer networks. A possible solution could be partitioning the network into different survivable subnetworks, in which an intra-subnetwork failure is resolved within its subnetwork (figure 9-2). layer

service

unprotected layer ATM protected layer SDH geographical location Figure 9-2: Partitioning in ATM and SDH protected subnetworks.

The separate treatment of network survivability in different layers may also lead to the implementation of non optimal solutions. For instance, in the case of a cable break inside the SDH core network, which affects a lot of ATM connections, it would be much more efficient to perform protection switching in the SDH layer than to trigger a slow recovery action in the ATM layer. The approach can to implementing a suitable multi-layer recovery strategy be improved by considering the overall system layers (ATM, SDH and optical layers),. It is unlikely that there is an “optimal strategy” providing the best solution in every network context. Operators have to choose the strategy that best suits their network scenario and its requirements. Factors that may influence the choice between single and multi layer recovery include: • the set of failures the network should be able to survive • the budget for network survivability, • the recovery schemes already deployed or familiar to the operator • the network topology.

ACTS Guideline NIG-G5

Page 15

Towards Resilient Networks and Services

9.2.

NIG-G5/0699

Issues

To optimise overall recovery performance, the recovery schemes employed in different layers need to be co-ordinated. This co-ordination is known as inter-working or escalation between layer networks [1], [3], [4]. The objectives for an integrated approach to multi-layer survivability include: • to avoid contention between the different single-layer recovery schemes, • to promote co-operation and sharing of spare resources, • to increase the overall availability that can be obtained for a certain investment budget, • to decrease investment costs which are required to ensure a certain survivability target. Single-layer recovery options have been studied extensively in the context of single-layer survivability [1]. Multi-layer recovery options are not yet widely understood. In addition to the issues discussed in the earlier sections dealing with individual network layers, network operators considering multi-layer recovery have to address the following challenges: • Multi-layer recovery approach: For each failure, a choice has to be made of which recovery scheme to apply at which layer of technology. In a multi-layer network, there may be several schemes and/or layers responsible for overall resilience against a certain type of failure, e.g. a break in a cable carrying both ATM traffic and SDH leased line traffic. • Clear view of responsibility: When faults occur many alarms are generated at different layers. As a result, lower and higher layers detect faults and trigger fault management actions. The responsibility of each survivable layer (i.e. a layer equipped with a recovery scheme) in the overall recovery process has to be defined. This may require special functionality in the equipment. More specifically, it is important to ensured that recovery is not activated in at one layer when the failure is supposed to be resolved at another layer. This could result in competition for network resources, leading to network congestion and other unwanted network behaviour, for instance, unnecessary loss of extra traffic carried inside the protection path of the client layer. • Inter-working and co-ordination: When the responsibility for resilience is distributed across network layers (so-called escalation), the recovery actions of the individual layer mechanisms have to be co-ordinated. For instance, it may be that a client layer has to wait until its server layer has restored enough client layer spare capacity before it can start its own recovery. Inter-operation across different layers is required for coping with faults in a correct (not reacting in false alarms) and cost-effective manner (avoiding duplication), and to reduce the amount of fault management information that is propagated across the different layers and to the network management system. • Synchronisation and integrity: The fault detection and survivability mechanisms employed in each of the layers need to act in a synchronised way to ensure integrity of the information kept at each layer and in the management plane, and ensure timeliness within the all the levels of the restoration process. • Protection Resource Management: Because protection resources are required in every layer with a recovery scheme, several ways exist to combine the protection resource pools of the different layers. More specifically, the way a server layer supports the protection resources of its client layer(s) has been identified as yet another option of multi-layer survivability strategies. Multi-layer recovery schemes need to ensure that the amount of protection resources designed for each layer is sufficient for the resources to be protected within this layer and to avoid allocating protection resources for resources designed to be protected in the layer above. • Complexity: The implementation of a multi-layer survivability scheme might add complexity and overhead in each of the layers. At each layer, several real-time routines and related mechanisms (protocol engines) should be present to ensure correct implementation of the survivability scheme.

ACTS Guideline NIG-G5

Page 16

Towards Resilient Networks and Services

9.3.

NIG-G5/0699

Results

The ACTS project PANEL [57] has studied the issue of multi-layer survivability. The project COBNET has studied the inter-operation issues with respect to fault management for the optical layer. PANEL has developed a framework for multi-layer recovery, covering all the different options with which network operators are confronted, also setting out a number of alternatives that may be followed for each option. Figure 9-3 presents this framework for multi-layer survivability. The figure shows that survivability in multi-layer networks may be viewed as a three-dimensional problem, where the dimensions are: • Definition of layer(s)’ responsibilities with respect to failures (recovery approach) and the strategy to co-ordinate the inter-working of the recovery mechanisms at each layer (escalation strategy), • The strategy to design spare resource pools to support the spare resources required by the client layer in the server layer. • Recovery schemes implemented in each layer. These axes constitute the different options in the multi-layer survivability framework. Different alternatives for each of the options have been specified in [8], [12], [14], [15], [16]. The determination of a suitable multi-layer recovery strategy involves the evaluation of the combinations between the alternatives for each dimension from the viewpoint of investment cost and recovery performance. More details on the above dimensions and their alternatives, together with evaluation results can be found in [15], [21], [24]-[26].

Support of client layer spare resources

• Unprotected at server layer • Separated from native

S

r ha

ed

s

p up

• Recovery at lowest layer • Recovery at highest layer • Recovery at multiple layers

Escalation Strategy • Sequential activation - bottom-up - top-down - diagnostic

Recovery Layer Assignment • Pre-planned • Dynamic

Link vs. path based

Recovery inter-working strategies

server layer connections Recovery Approach

• Parallel activation

n Si

t or

gl

e

y la

er

r

o ec

Shared vs. dedicated spare resources

t

• Protected at server layer

t

Pre-palnned vs. dynamic route calculation

i ca

s

or

Centralised vs. distributed

d De

ed

p up

r ve

y

t op

io

ns

Figure 9-3: Framework for multi-layer survivability. Multi-layer recovery strategies may differ on a case-by-case basis. Different strategies may be defined for different failure types. In general, the set of failures is divided into a set of anticipated failures (i.e. most probable failures such as cable cuts) which are covered by the survivability strategy, and unexpected failures, which are not taken into account e.g. for budget reasons. The spare capacity pools are designed and deployed for the anticipated failures and will not generally be able to provide full recovery from unexpected failures. Comparisons of possible alternative approaches (see figure 9-3) also need to condider the particular network and traffic model, and the single-layer recovery options that are in place. Survivability schemes currently installed in the network may restrict the possible recovery approaches available to operators. A definitive comparison between different solutions has to be validated by results from a network evaluation process, where the approaches are measured in a quantitative way. More specifically, it is recommended that they should be evaluated in terms of investment, cost and recovery performance. This can be done using computer-aided ACTS Guideline NIG-G5

Page 17

Towards Resilient Networks and Services

NIG-G5/0699

network design and simulation tools. Some results for particular network scenarios can be found in [21]-[25]. Although recovery across multiple layers may seem complex, it may exploit the protection resources available in the network more efficiently. Recovery across multiple layers (and the associated escalation strategy) is thus very well qualified as a “second line of defence” against unexpected – catastrophic – failures. PANEL makes the following observation concerning the alternatives for multi-layer survivability: Recovery at the lowest layer seems to be the most suitable approach for a fast and effective recovery of the more troublesome failures like cable cuts. In fact, this is proved by the growing interest in optical recovery, which becomes inherently attractive as the network throughput increases [11]. On the other hand, recovery at the highest layer may be better suited when the reliability requirements are highly differentiated and have to be tailored to each client. In fact, when client and server network layers are operated by different companies, the client layer operator may prefers to use its own survivability schemes over unprotected server paths, instead of relying on the higher “availability” of protected server paths. Recovery at the highest layer is recommended: • if multiple reliability grades are to be provided to services • if recovery inter-working is not implemented (operator policy/equipment constraints) • if experience of the operator and/or product maturity (standardisation, recovery schemes, equipment) are higher in the client layer This approach is not recommended: • if the failure scenario is too complex enough for efficient sharing of protection resources • if the failure scenario is to complex for the design of protection resources • if it is impossible to share protection resources between client and server demands Recovery at the highest layer is not applicable if the server layer cannot ensure that the physical working resources of the client are disjoint from its spare resources. The lowest layer recovery strategy is recommended: • if the number of entities to recover is limited/reduced • if experiences of the operator and/or product maturity (standardisation, recovery schemes, equipment) are higher in the server layer • if the client layer equipment is more expensive than the server layer one In these cases, the lowest layer recovery approach gets more value out of the client layer equipment and avoids using expensive client layer capacity for the rather “mundane” task of protecting against lower layer failures. When routing policy and/or client demands leads to a high number of client transits, protection selectivity or (better) common pool enables a cost effective implementation of the lowest layer recovery approach. Lowest layer recovery needs co-ordination of recovery schemes: hold-off times are useful for inter-working of protection schemes and these times should be provisioned on an individual connection basis. The failure condition should be continuously monitored for the full duration of the hold-off time before switching occurs. Another inter-working scheme called the recovery token [16], [12] is based upon the transfer of an explicit OAM message between server and client layer. The recovery token speeds up performance, but its still an experimental concept. The hold-off time based inter-working mechanism seems to be the best compromise between recovery performance and implementation complexity. In the case of sequential interworking, the recovery should start in the layer with the fastest recovery scheme and, if they are equally fast, it should start in the lowest layer. Regarding specific layer network architectures:

In SDH-based-ATM networks ACTS Guideline NIG-G5

Page 18

Towards Resilient Networks and Services

• •

NIG-G5/0699

flexibility at the VC-4 layer (for connections supporting ATM demands) offers the benefit of maximising fibre utilisation separate support of the ATM layer using STM-1 DFs offers amore cost-effective implementation of the highest layer recovery approach

In WDM-based-SDH networks where WDM technology is only deployed in a part of the core network (evolutionary scenario): • survivability should only be shifted towards the optical layer when the WDM layer is well developed and adopted in major parts of the network • the disadvantage of leaving the protection completely in the SDH layer is that it is difficult to obtain a good utilisation of the capacity provided by the optical paths • the disadvantage of shifting the protection completely to the WDM layer (without any SDH protection in WDM parts) is that the SDH paths are protected in a segmented way • multi-layer survivability offers the highest degree of reliability, but may require a substantial amount of (protection) resources, when implemented in a traditional way • protection selectivity in the SDH layer may offer large cost savings, both in SDH and WDM equipment • more savings can be achieved by supporting working and spare SDH resources differently in the WDM layer (protection selectivity or better common pool) With respect to the role of the overlying management system: Restoration through management activities is well suited as a second line of defence (see also section 10) It is recommended that the management systems of the different layers should be integrated for alarm correlation purposes and for co-ordinating the centralised restoration actions.

10. Integration of Survivability with Other Network Functions The cost-effective operation of a communication network involves optimising the usage of network resources so as to maximise network throughput whilst meeting the QoS service level agreements with the users. Survivability should not be tackled in isolation from the other network functions. The functionality involved in the deployment of the survivability strategy should be closely integrated with the other network operational (e.g. connection admission control, routing) and management (e.g. network planning and resource dimensioning) functions, in order to maintain acceptable levels of network throughput under fault conditions. The reconfiguration of the network due to the actions taken by the recovery mechanisms may influence the overall throughput on the network considerably, and hence survivability should influence the design of all other network operation and management functions. The ACTS project REFORM [39] has developed a functional model for ensuring costeffective network operation under QoS constraints and changes in topology or traffic conditions [34], [35]. From a functional perspective, survivability needs to be tightly coupled with network routing and resource management functions. Network routing functions should adapt themselves to topology changes avoiding damaged routes. Resource management functions should take into account the fact that protection resources might be allocated in the network. For instance, dynamic resource reservation and management functions should be constrained by available protection resource constraints in addition to QoS and physical link capacity constraints,. Dynamic management of defined resources should also ensure that the required protection resources can be made available in the network. The coupling of survivability and dynamic routing and resource management functions should be seen from the following angle. While ACTS Guideline NIG-G5

Page 19

Towards Resilient Networks and Services

NIG-G5/0699

survivability schemes aim at recovering the service of existing traffic, dynamic routing and resource management functions aim at ensuring desired levels of network availability to subsequent traffic (e.g. by avoiding damaged network areas). With respect to routing under fault conditions: Use of source-node routing is recommended to avoid loops in routing during the period immediately after a fault. The activities of the restoration schemes should be transparent to the operation of the routing and resource management functions to avoid protocol overhead and possible transient errors. Restoration activities should avoid altering resource identifiers used for routing. Equally important from a functional perspective, is the relationship between the network survivability mechanisms and the overlying network management functions. Both fault management and configuration management play a role in survivability (e.g. planning and configuration of protection resources required by the network to achieve the desired level of integrity). There needs to be an active interplay between (distributed) survivability schemes and network management functions in the fault and configuration management areas. Network management should not be involved in the restoration process triggered when faults are encountered, as the objective at this stage is fast restoration. The involvement of network management functions should be seen from the point of view of complementing and supporting the actions of the restoration mechanisms triggered to recover existing traffic. In particular: • Configuration management functions are required for designing and configuring the amount of protection resources required in the network and for operating the network at the desired level with a reduced set of resources. • Fault alarm surveillance functions are required for network administration reasons and for triggering automatic reconfiguration actions to ensure that the network will be able to withstand future faults (normalisation).

11.

Survivability Architectures and Means

Survivability mechanisms and related architectures are currently under study by relevant standardisation bodies (section 12). Architectures for fault management have been proposed by several bodies (ITU-T, ISO/OSI TMN, TINA, ATMF). However, the issue of fault recovery is not explicitly considered in these architectures and they need to be extended to cover restoration aspects (tolerance of services, protection resources). To date, the issue of fault management in optical networks has received little attention. The ACTS project REFORM has realised a system for ATM network survivability and availability, based on the ITU-T, TMN and TINA architectures [34]. [35], [39]. The project MISA [59] has addressed integrated fault management across ATM and SDH networks. Existing architecture related specifications of MIBs (by ISO/OSI, ITU-T TMN and X.xxx recommendations, ATM M4, TINA NRA) provide information models for general fault management purposes (e.g. alarm reporting), however, they lack support for restoration activities (e.g. protection resources). ITU-T I.610 recommendation, specifying standard means for fault detection in IBCNs, through the use of OAM (Operations and Maintenance) cells needs to be revised to cater for restoration requirements of multi-media services. The current specification results in reliable fault detection in 3.5 seconds, leading to restoration times in the order of seconds which will probably not meet the out-of-service tolerance of multi-media services. I.610 assumes OAM continuity check cells to be a ‘fallback’ mechanism, when the lower level would not adequately handle fault detection and escalation. This should be configurable.

ACTS Guideline NIG-G5

Page 20

Towards Resilient Networks and Services

NIG-G5/0699

Reliable fault detection mechanisms with configurable fault detection targets would facilitate optimising the trade-off between the benefits of the survivability strategy (tolerable restoration time) and the incurred cost (overhead).

12.

On-going Standardisation

12.1. ATM Forum The ATM Forum view on network architecture (according to PNNI version 1.0 specification [60]) does not explicitly require the establishment of a VP infrastructure prior to switched VCC provisioning. Therefore, the protection architecture relies either on the physical layer protection switching mechanisms (e.g. SDH) or on explicit re-establishment of the service itself (in PNNI either the switched VC or switched VP). The latter is currently being specified in the ATM forum for PNNI version 2. More information can be found in draft specification PNNI version 1.0 addendum for fault tolerance in PNNI networks. An overview of the protection architecture in PNNI-based networks is provided in [36].

12.2. ITU–T The scope of ITU-T recommendation includes APS (1:1, 1+1,m:n), SHN (Self-healing Networks) and RCN (Re-configurable Networks) in ATM networks. In particular, VP/VC 1:1 APS has been discussed first as a fundamental scheme in ATM [12], [56]. In ITU-T view of network architecture, switched VC services may be protected by applying protection switching methods at the underlying VP infrastructure. Recommendation I.630 addresses the concepts of individual protection for ATM VC and VP connections, and group protection for a number of connections applied to bi-directional 1+1 and 1:1 ATM protection architectures. In addition, the recommendation defines a 1-phase protocol, which allows for the completion of protection switching using a single transmission direction of the co-ordination message. An overview of ITU-T studies related to survivability is provided in [38].

13.

Conclusions

This Guideline has consolidated relevant work undertaken by ACTS projects on the subject of network survivability. Issues covering the whole spectrum of network survivability have been addressed and recommendations made, notably in the areas of ATM and multi-layer network survivability. These recommendations are based on practical experience gained through implementations and trials undertaken by ACTS projects. Up-to-date information on standardisation activities with respect to the issues addressed, has also been provided. Network survivability is a difficult multi-objective optimisation problem. A number of topics still remain open, as can be seen from the on-going efforts in standards bodies and the extensive literature. Particular topics, on which have studies have only recently started, include: • definition of standards for survivability (architecture, mechanisms), • optimum specification of survivability strategies, • interoperation of the fault management functions across different network technology layers and their integration with routing and resource management functions. It is expected that survivability will grow in importance in the immediate future as the bandwidth*delay product of optical broadband networks increases and service quality is introduced.

14.

References

ACTS Guideline NIG-G5

Page 21

Towards Resilient Networks and Services

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21]

NIG-G5/0699

T.-H. Wu, “Emerging Technologies for Fiber Network Survivability”, IEEE Communications Magazine, vol. 33, no.2, pp. 58-74, 1995. D. Johnson, “Survivability strategies for broadband networks”, In Proc. of IEEE Globecom’96 conference, London, 1996. J. Manchester and P. Bonenfant, “Fiber Optic Network Survivability: SONET/Optical Protection Layer Interworking”, In Proc. of NFOEC’96, Denver, CO, 1996. L. Nederlof et al., "End-to-end survivable broadband Networks", IEEE Communications Magazine, vol. 33, no. 9, pp. 63-70, 1995. T. Noh, “End-to-end self-healing SDH/ATM Networks”, In Proc. of IEEE Globecom’96 conference, London, 1996 P. Veitch, D. Johnson and I. Hawker, “Design of Resilient Core ATM Networks”, In Proc. of IEEE Globecom ‘97 conference, Phoenix, 1997. T.-H. Wu and N. Yoshikai, “ATM Transport and Network Integrity”, Academic Press, 1997. K. Struyve et al., “Design and Evaluation of Multi-layer Survivability for SDH-based ATM Networks", In Proc. of IEEE Globecom ‘97 conference, Phoenix, 1997. K. Van Doorselaere et al.,”Fault Propagation in WDM-based SDH/ATM Networks”, DRCN’98, Brugge, Belgium, 1998 K.-I. Sato, “Advances in Transport Network Technologies, Photonic Networks, ATM and SDH”, Artech House, London, 1996. O. Gerstel, “Opportunities for Optical Protection and Restoration”, Proc. of OFC’98, San Jose, 1998. ITU-T Draft Recommendation I.630 (ex I.ps), “ATM Protection Switching”, Geneva June 1998 ETSI Document ETS 300 746, “Transmission and Multiplexing; SDH Network Protection schemes; Automatic Protection Switch (APS) Protocols and Operation”, February 1997 M. Gryseels et al., “Common Pool Survivability in ATM on SDH Ring Networks”, DRCN98 Workshop, Brugge, Belgium, 1998. M. Gryseels et al., “A Cost Evaluation of Service Protection Strategies in ATM on SDH Transport Networks”, DRCN98 Workshop, Brugge, Belgium, 1998. A. Autenrieth et al., “Simulation and Evaluation of Multi-single Layer Broadband Networks”, DRCN98 Workshop, Brugge, Belgium, 1998 A. Jajszczyk, C. Brianza and D. Janukowicz, “TMN-based Management of Multi-layer Communication Networks”, DRCN98 Workshop, Brugge, Belgium, 1998. ITU-T Recommendation G.805, “Generic Functional Architecture of Transport Networks”, 1995. ITU-T Recommendation G.841, “Types and Characteristics of SDH Network Protection Architectures”, November 1997. ITU-T SG13 Contribution, NTT, “Scheme of (1:1) VP/VC Automatic Protection Switching (VP/VC-APS)”, Geneva, Switzerland, July 1995 M. Gryseels and P. Demeester, “A Multi-layer Planning Approach for Hybrid SDHbased ATM Networks”, 6th International Conference on Telecommunication Systems, Modeling, and Analysis, Nashville, 1998.

[22]

R. Kawamura, K.-I. Sato and I. Tokizawa, “Self-healing ATM Networks based on the Virtual Path Concept”, IEEE JSAC, vol.12, no.1, January 1994.

[23]

R. Kawamura and I. Tokizawa, “Self-healing Virtual Path Architecture in ATM Networks”, IEEE Comm. Mag., vol. 33, no. 2, September 1995.

[24]

M. Pickavet, M. Gryseels and P. Demeester, “A Zoom-in Algorithm for the Design of SDH Networks with Multiple Reliability Classes”, 6th International Conference on Telecommunication Systems, Modeling, and Analysis, Nashville, 1998.

[25]

K. Struyve et al., “Design and Evaluation of Distributed Link and Path Restoration Algorithms for ATM Meshed Networks”, IZS’96, Zurich, 1996.

[26]

K. Struyve et al., “Design and Evaluation of an Accelerated ATM Backup Virtual Path Recovery Protocol”, IEEE ATM'97 Workshop, Lisboa, 1997.

ACTS Guideline NIG-G5

Page 22

Towards Resilient Networks and Services

[27] [28] [29]

[30] [31] [32] [33] [34]

[35] [36] [37] [38] [39] [40] [41] [42] [43] [44]

[45] [46] [47]

[48]

[49] [50] [51]

[52] [53] [54]

NIG-G5/0699

Imai K., Honda T., Kasahara H., Ito T., “ATMR: Ring Architecture for Broadband Networks”, Proceedings of IEEE GLOBECOM’90, December 1990. Kawamura R., Sato K-I, Tokizawa I., “Self-Healing ATM Network Techniques Utilizing Virtual Paths”, Networks ’92, Kobe, Japan. Kawamura R., Hadama H., Tokizawa I., “Implementation of Self-healing Function in ATM Networks Based on Virtual Path Concept”, Proceedings Infocom 95, 3b.1 , p303-311 Nederlof L., Struyve K., O’Shea C., Misser H., Du Y., Tamayo B., “End-to-end Survivable Broadband Networks” IEEE Communications Magazine, September 1995 Gareis R., Heywood P., “Tomorrow’s Network Today”, Data Communications, September 1995, pp. 55-65 Bellcore, “Digital Cross-Connect Systems in Transport Network Survivability”, SRNWT-002514, Issue 1, January 1993 Veitch P., Hawker I., Smith G., “Administration of Restorable Virtual Path Concepts”, IEEE Communications Magazine, December 1996 T’Joens Y., P. Georgatsos, S. Sartzetakis and D. Ranc, “Integrated Dynamic Routing, Load Balancing and Survivability in ATM-based IBCN”, NOC’97, vol II, pp 287-295 T’Joens Y., Georgatsos P., Georgiadis L., “Network Reliability in ATM based IBCN, a Functional Description of the REFORM System”, DRCN’98, May 1998 T’Joens Y., Sales B., De Neve H., Van Mieghem P., “Fault Tolerance in a Hierarchically Structured Dynamic Routing Environment”, DRCN’98, May 1998 T’Joens Y., “A Contingency Model for Survivability Strategy Determination at the ATM Layer ”, NOC’98, May 1998. Bonnifait M., Ohta H., Manchester J., “ITU-T Ongoing Studies on ATM Protection Switching”, DRCN’98 Brugge, Belgium, Proceedings, O12. REFORM, On line information : http://www.algo.com.gr/acts/reform Vandenhoute M., Ester G., T’Joens Y., “Restoration Alternatives for Optical and SONET/SDH-based IP Networks”, Alcatel Telecom Review, 1999 (to be published) Jaffe J., Moss M., “A Responsive Distributed Routing Algorithm for Computer Networks”, IEEE Trans. Comm., Vol.30, No 7, July 1982. Merlin P., Segall A., “A Fail-safe Distributed Routing Protocol”, IEEE Trnas. Comm., Vol.27, No 9, Sept. 1979. Segall A., “Advances in Verifiable Fail Safe Routing Procedures”, IEEE Trans. Comm., Vol.29 No 4, April 1981. Tajibnapis W., “A Correctness Proof of a Topology Information maintenance Protocol for a Distributed Computer Network”, Comm. Of ACM, Vol.20, No 7, July 1977. Pogunke W., “On the Design of Communication networks using Restricted Message Routing”, IEEE Trns. Comm., 1993. Oki E., Yamanaka N., Pitcho F., “Multiple Availability Level ATM Network Architecture”, IEEE Comm. Magazine, Sept. 1995. Medhi D., “A Unified Approach to Network Survivability for Teletraffic Networks: models, Algorithms and Analysis”, IEEE Trans. Comm., Vol.42, No2/3/4, Feb./March/April 1994. Medhi D., Khurana R., “Optimisation and Performance of Network Restoration Schemes for Wide-Area Teletraffic Networks”, Journal of Network and Systems Management, Vol.3, No3, pp.265-294, Sept. 1995. Veitch P., Johnson D., “ATM Network Resilience”, IEEE Network, Sept./Oct. 1997. Ayanoglu E., Gritlin R., “Broadband Network Restoration”, IEEE Comm. Magazine, July 1996. Anderson J., Doshi B., Dravida S., Harshavardhana P., “Fast Restoration of ATM Networks”, IEEE Journal on Selected Areas on Communications, Vol12, No1, Jan.1994. Kawamura R., “Architectures for ATM Network Survivability”, IEEE Comm. Surveys, 4rth quarter 1998, Vol.1, No1, 1998. Veitch P., Hawker I., “Administration of Restorable Virtual Path Mesh Networks”, IEEE Comm. Magazine, Dec.1996. Wu T-H., “Fiber Network Service Survivability”, Artech House, 1992.

ACTS Guideline NIG-G5

Page 23

Towards Resilient Networks and Services

[55] [56] [57] [58] [59] [60]

NIG-G5/0699

ACTS Programme – on line information: http://www.infowin.org ITU-T “Draft Rec. (I.ps): ATM Network Survivability Architecture and Mechanisms”, Feb.1997. PANEL, On line information: http://intec.rug.ac.be/www/u/panel ITU-T “Recommendation I.610: B-ISDN Operation and Maintenance Principles and Functions”, 1995. MISA, On line information: http://www.misa.ch ATM Forum “Private NetworkNode Interface, version 1.0”, af-pnni-055.000.

ACTS Guideline NIG-G5

Page 24