An Integrated Scheme for Establishing Dependable Real-time Channels in Multihop Networks Sriram Raghavan Dept. of Computer Science Stanford University, Stanford, CA 94305, USA ([email protected]
G. Manimaran Dept. of Electrical and Computer Engineering Iowa State University, Ames, IA 50011, USA ([email protected]
C. Siva Ram Murthy Dept. of Computer Science and Engineering, Indian Institute of Technology, Madras 600036, INDIA ([email protected]
Abstract The issue of providing fault-tolerance in real-time communication has been a problem of growing importance. There are two basic approaches for satisfying fault-tolerant requirements in real-time communication: (i) forward error recovery approach and (ii) detect and recovery approach. The first approach is well-suited for hard real-time communication, whereas the second approach is well-suited for soft real-time communication. Neither of these basic approaches is well-suited for applications which involve both hard and soft real-time communication. In this paper, we propose an integrated scheme that combines the benefits of both the basic approaches. The proposed scheme not only caters to such mixed communication requirements, but also improves the call acceptance rate significantly due to its efficient resource allocation mechanisms such as traffic dispersion and backup multiplexing. The effectiveness of the proposed scheme has been evaluated through extensive simulation studies.
1. Introduction Packet switched data networks are increasingly being utilized for carrying multimedia traffic such as video and audio which often require stringent quality of service (QoS) in terms of bounded end-to-end delay, delay jitter, and packet loss. For a network to provide performance guarantees for such multimedia applications, real-time channels  are to be established with specified traffic characteristics and QoS requirements. There are two phases involved in handling a real-time channel: channel establishment  and run-time This work was done when the authors where at the Indian Institute of Technology, Madras, India.
scheduling of packets . The channel establishment phase involves the selection of a qualified path for the channel (i.e., a path satisfying the traffic characteristics and QoS requirements of the call request) and the reservation of resources along this path. On the other hand, the run-time scheduler schedules packets at run-time adhering to the guarantees provided during channel establishment. Real-time channels provide QoS guarantees as long as the traffic sources obey their traffic specifications and there are no component (link and/or node) faults. The regulation of traffic is usually achieved by traffic shaping at the source nodes and faults are usually handled either by rerouting the channel around the faulty components or by employing redundant channels. The issue of providing fault-tolerance in real-time communication has been a problem of growing importance due to the criticality of the data involved and an outage for a small interval of time will result in the loss of a large volume of data. This motivates the need for networks which have the capability to tolerate or detect and recover from faults. The rest of the paper is organized as follows: Section 2 briefly reviews dependable real-time networks. In Section 3, the proposed dependable real-time channel establishment scheme is presented. In Section 4, the performance study of the proposed scheme is presented. Finally, in Section 5, some concluding remarks are made.
2. Dependable Real-time Networks Dependable real-time networks can be classified into two broad categories: (i) protection based networks and (ii) restoration based networks. In protection based networks, dedicated protection mechanisms are provided to cope with faults. These networks use forward recovery approach. In restoration based networks, on the occurrence of a fault, an
attempt is made to acquire the resources necessary for restoring the channel from the fault. These networks use detect and recovery approach.
2.1. Forward Recovery Approach The forward recovery approach is an example of a protection mechanism in which multiple copies of a packet are sent simultaneously over disjoint paths to mask the effect of faults. The advantage of this approach is that it transparently handles faults without service disruption. However, it incurs extra cost in terms of extra bandwidth usage. Traffic Dispersion in Forward Recovery Approach: Traffic dispersion is a mechanism in which a channel is split into multiple sub-channels and the traffic is dispersed along these sub-channels in parallel . The bandwidth of the subchannels of a channel are the same and their sum is equal to the bandwidth of the channel. Since the bandwidth requested by each of the sub-channels is very low compared to the channel bandwidth, the chances of establishing many small bandwidth sub-channels is very high compared to establishing a single high bandwidth channel, thereby improving the call acceptance rate. When dispersity routing is combined with forward recovery approach, extra sub-channels are used for fault-tolerant purpose and the paths of the sub-channels should be disjoint in order to tolerate faults. The dispersity based forward recovery approach, called (N; K; S ) system , is formally stated as follows: Dispersity(N): splitting a channel into N sub-channels. A packet is split into N sub-packets and all sub-packets are sent in parallel, one per sub-channel. Redundancy(N-K): N -K sub-channels carry redundant information to achieve fault recovery. A certain number of sub-packets can be corrupted or lost without affecting the decoding of the message. The number depends on N; K , and the error-correcting code used. It can be no larger than N -K . Disjointness(S): S is the maximum number of subchannels of a channel that can share a given link. When it becomes impossible to determine enough disjoint paths to meet the dispersity requirements, sharing of links by subchannels proves to be very useful in improving the call acceptance rate. A channel with attributes (N; K; S ) can tolerate up to b N ?S K c faults transparently, if maximum distance separable codes are used . The bandwidth used by an (N; K; S ) system is approximately N=K times the bandwidth required by an equivalent non-fault-tolerant channel. The redundant subchannels of a dispersity system can be operated in hot/cold standby mode. In the case of hot standby, extra sub-channels carry forward error correction (FEC) information during normal operation. For cold standby, extra sub-channels are used only in the event of a fault.
2.2. Detect and Recovery Approach
In the case of restoration based networks, on the occurrence of a fault, an attempt is made to acquire the necessary resources, such as bandwidth and buffers, for restoring the channel around the faulty components with minimal disruption in service. This approach is useful when occasional packet losses due to transient faults are tolerable and restoration is required only when a permanent fault is detected. Primary-backup scheme is one of the most important schemes under this category. Primary-backup Channel Scheme: A real-time channel is dependable if it has the desired number of backup channels, i.e., a dependable channel (D-connection or D-channel) consists of one primary and one or more backup channels acting as cold standby . The bandwidth of the backup channels is the same as its primary channel. A backup channel is activated on detection of a fault in any network component along the path of the primary channel. The two main issues here are: (i) backup channel(s) establishment, before the occurrence of faults, which involves selecting route(s) for the backup channel(s) (backup route selection) and assigning bandwidth to the backup channel(s) (spare resource allocation), and (ii) fault detection and backup activation. The D-connections are setup and torn-down dynamically. The aim of any primary-backup channel scheme is to establish dependable channels that minimize the spare resource reservation, thereby improving the call acceptance rate. (i) Backup Route Selection: The problem of optimal routing of backup channels to optimize on spare resource allocation is known to be NP-complete . Therefore, heuristic cost functions and least-cost routing are used for setting up paths. A cost function that takes backup multiplexing into account gives a better performance but requires information regarding routing path of all the primary channels in the entire network for evaluating the cost function. Therefore, the backup route selection is centralized and is usually done at the source node . (ii) Spare Resource Allocation with Backup Multiplexing: This is a resource sharing technique used in the primarybackup channel scheme wherein, on each link, only a very small fraction of the resources, such as bandwidth, are allocated for all backup channels using that link. Such a link is said to be multiplexed. Typically, the bandwidth assigned to a multiplexed link is the maximum among the bandwidth requirements of all the backup channels multiplexed onto it. The conditions for two or more backup channels to be multiplexed on a link are: (i) the paths of the primary channels, of the multiplexed backup channels, must be disjoint and (ii) at most one of the primary channels, of the multiplexed backup channels, can fail at any point of time.
2.3. Related Work Dispersity routing coupled with error correction coding schemes reported in  is an example for forward-recovery approach. It is too expensive for soft real-time applications such as multimedia applications, wherein occasional packet losses can be tolerated. The methods proposed in - belong to detect and recovery approach. The method proposed in  does not allocate resources in advance for the channels. This results in larger restoration time, but less resource overhead in the absence of faults. The approach proposed in  provides guaranteed fault recovery under a single fault model. This method guarantees faster restoration, but has more resource overhead. The primary-backup scheme reported in  uses backup multiplexing to minimize the spare resources. The approaches proposed in - are VP restoration methods in ATM networks based on the backup VP concept. In these approaches, the fault-tolerance level of each connection is not individually controllable.
2.4. Motivation and Objectives The requirement of supporting both hard and soft realtime communication with fault-tolerance is quite common in many real-world applications such as industrial process control and air traffic control. Real-time channels with hard guarantees cannot tolerate loss of even a single packet, whereas real-time channels with soft guarantees can tolerate occasional packet losses. Hard real-time communication is supported by deterministic channels , which guarantee absolute delay bound, whereas soft real-time communication is supported by statistical channels . The forward recovery approach is inefficient for statistical channels because they establish backup channels in hotstandby mode, which is unnecessary in a soft real-time communication. This leads to a poor call acceptance rate. Similarly, the detect and recovery approach is unsuitable for deterministic channels since they do not provide zero restoration time, thus resulting in loss of packets, which is not acceptable in a hard real-time communication. This motivates the need for integrating forward error recovery approach and detect and recovery approach in a single fault-tolerant realtime channel establishment scheme in order to support both hard and soft real-time communication requirements. The objective of our work is to propose a dependable real-time channel establishment scheme for dynamically establishing D-channels in multihop real-time networks which (i) supports both hard and soft real-time communication and (ii) provides control over the fault-tolerance level of each Dchannel. This involves route selection and bandwidth assignment for both primary and backup channels.
3. The Proposed Scheme To realize the above objective, we propose a scheme that combines the bandwidth splitting and FEC advantages of dispersity based forward recovery approach with the efficient spare resource allocation achieved by backup multiplexing used in the detect and recovery approach. The proposed scheme is characterized by a 3-tuple (Nm ; Nfec ; Nb ), where Nm denotes the number of message sub-channels (without redundancy). This means that a given packet is split into Nm sub-packets and transmitted in parallel on these Nm subchannels. Nfec denotes the number of (redundant) FEC subchannels. Nfec is a function of the dispersity level (Nm ) and the fault model used (i.e., whether single/multiple simultaneous faults are to be handled). In the case of single fault model, it can be shown that for successful error recovery, Nm and Nfec must obey the inequality (Nm + Nfec + 1)
This restriction places a lower bound on the value of Nfec for a given value of Nm . For example, Nm = 2 gives a value of Nfec = 3. Error-correcting codes that achieve this lower bound for all values of Nm are known. For tolerating more than one fault, similar inequality exists. Nb denotes the number of backup sub-channels. The methods used for calculating Nb and backup multiplexing are discussed below. In the proposed scheme, the backup sub-channels operate in cold standby mode to allow multiplexing possible; and the FEC sub-channels operate in hot standby mode to have error correction capability. The message and FEC subchannels are together referred to as primary sub-channels (Nm + Nfec ). The bandwidth to be allocated for each subchannel (referred to as sub-channel bandwidth), whether primary or backup, is N1m th of the total bandwidth specified by the channel establishment request. The proposed scheme has the flexibility of being reduced to either of the two basic approaches by a suitable choice of Nm , Nfec , and Nb . For example, when Nb = 0, the proposed scheme reduces to the dispersity based forward recovery approach. Similarly, when Nm = 1 and Nfec = 0, the proposed scheme reduces to the primary-backup channel approach. Choice of Number of Backup Sub-channels (Nb ): The choice of Nb in the proposed scheme depends on the following two factors: (i) the nature of the fault model used to characterize permanent faults and (ii) whether the redundancy, usually used to counter transient faults, is also used to overcome permanent faults. For example, consider a certain system that is capable of handling one transient fault through the use of FEC subchannels. Suppose the fault model used for permanent faults
is the double-link fault model. If a single permanent fault occurs in one of the primary sub-channels of this connection, a backup sub-channel can be used to counter this fault and the system can continue to enjoy single transient fault recovery. Now, suppose one more permanent fault occurs in one of the primary sub-channels. There are now two possible ways of countering this fault: (i) use the inherent FEC capability of the system to overcome this fault and do not use one more backup sub-channel (this will mean that the system will no more be capable of overcoming transient faults) or (ii) to retain the capability for transient fault recovery, use one more backup sub-channel to overcome this second permanent fault. Clearly, depending on the importance of recovering from transient faults (which will be dictated by the application), the number of backup sub-channels could be either equal to, or less than, the maximum number of permanent faults that are allowed by the fault model. This gives an added flexibility to the system and could be used to efficiently choose the correct number of multiplexed backup sub-channels. Note that in the pure primary-backup scheme , the number of backup channels is equal to the maximum number of permanent faults allowed by the fault model. Backup Multiplexing in the Proposed Scheme: The proposed scheme uses the backup multiplexing technique of  to reduce the amount of spare resources reserved for the backup sub-channels. Backup multiplexing is based on the idea that it is possible to determine whether two backup channels will simultaneously be brought into action. Given the paths of the primary channels, if it can be shown that, under a given fault model, two particular backup channels will never be activated together, then the two can share the bandwidth on their common links. In the proposed scheme, we have used a modified version of the backup multiplexing algorithm proposed in , to calculate the amount of spare resources to be allocated for the backup sub-channels of the various connections. In addition to the functionalities of the original backup multiplexing algorithm , our modified algorithm takes traffic dispersion also into account.
4. Performance Studies In this section, we present the results of the simulation experiments that were conducted to analyze the performance of the proposed scheme. In our simulation experiments, the traffic characteristics and QoS requirements of real-time channel requests are generated based on the models used in . The simulation experiments were conducted on randomly generated networks to avoid any influence of network topology features on the performance results. Channel establishment requests with fault-tolerant requirement were supplied one after another, to the various simulation networks. Both transient transmission errors and permanent link faults were introduced into the network.
Losses due to transmission errors were simulated by dropping each packet with a given probability1 on each link. Permanent faults were introduced by randomly choosing a link and disconnecting it from the network. The simulations were performed for a single-link/doublelink fault model, at various probabilities of packet loss on the links, and with various background loads in the network. These background loads were generated by setting up varying numbers of simple unicast real-time channels, before starting the observation of the fault-tolerant systems. Since all these non-fault-tolerant channels were assigned identical bandwidths, the load was measured simply by the number of such channels setup. To analyze the performance of various fault-tolerant channel establishment schemes, we used the following metrics: Average Call Acceptance Rate (ACAR): This metric measures the ability of a (fault-tolerant) channel establishment scheme to efficiently utilize the network resources and successfully setup a large number of channels. ACAR for a given scheme is defined as the ratio of the number of calls successfully setup to the total number of call-requests received by the network. Average Recovery Time (ART): The ability of a scheme to quickly react to faults and to activate the necessary recovery mechanisms is an important factor that determines its usefulness. Consider a real-time channel C which encounters a fault on one of its links at time T1 . After the fault, let T2 denote the time at which the next packet was successfully received by the destination. Then, the recovery time for this instance of the fault is given by T2 ? T1 . The ART metric is defined as the average (over all faults and channels) of all such recovery times.
4.1. Results and Discussions The results of the simulation experiments have been summarized in Figures 1 to 4. Figure 1 deals with an interesting feature of dispersity, Figure 2 studies the effect of the backup multiplexing technique, whereas Figures 3 and 4 analyze the performance of the proposed scheme in the context of the two performance metrics. Note that unless otherwise mentioned, all sub-channels of a given connection were routed on linkdisjoint paths. Similarly, unless specified to the contrary, the fault-model is assumed to be single-link fault. Algorithm Labels: The plots are labeled with tuples of the form (x; y; z ) that respectively denote the number of message sub-channels, the number of FEC sub-channels, and the number of backup sub-channels. A suffixed to the tuple will mean that the scheme used backup multiplexing to allocate resources for the backup sub-channels. For example, the label (1,0,1)* denotes primary-backup channel scheme
1 This probability is directly proportional to the error rate chosen for the simulation run.
Average Call Acceptance Rate
load=70 load=90 load=110 load=130
2 3 4 5 Number of message sub-channels
Figure 1. Effect of Nm on ACAR for (Nm ; 0; 0)
Average Call Acceptance Rate
100 (1,0,1) (1,0,1)* (2,0,1) (2,0,1)*
Figure 2. Effect of backup multiplexing on ACAR
Average Call Acceptance Rate
100 (2,3,1)* (1,0,1)* (2,3,0) (2,0,1)*
Figure 3. ACAR of the various schemes 100
with backup multiplexing proposed in . Similarly, the label (2,3,0) denotes the dispersity based forward recovery approach as used in . (a) Effect of Number of Message Sub-channels: Figure 1 studies the variation of the ACAR metric with Nm , the number of message sub-channels used by the proposed scheme. The figure plots this variation under different load conditions. For all the experiments that contributed to Figure 1, the proposed scheme was run for the parameters (Nm ; 0; 0) with Nm taking integral values in the range [1,6]. Two key observations can be made from the plots in Figure 1. The first obvious observation is that as the background load increases, the acceptance rate of the scheme decreases, as expected. The second observation is that, for each load condition, the acceptance rate initially improves with Nm (up to a value of around 4) and then decreases sharply. The initial increase is due to the fact that each sub-channel in a dispersitybased system has a smaller bandwidth requirement than the original channel-establishment request and the requirement get smaller as Nm increases. This effect contributes to an improvement in ACAR, through two ways: First, the level of external fragmentation of bandwidth on the links decreases, since smaller bandwidth channels can be packed more efficiently on the links. Consequently, there is a better chance for a call to be accommodated with the given network resources. Second, the fact that the bandwidth requirement of each individual sub-channel is smaller than the total requirement of the call, also has an important effect on the ability of the network to accept future requests. Since the amount of resources used on each path is lower, the effect of the request is spread out over the entire network, so more capacity remains on each individual path as compared to a real-time channel that places all its resource requirements on a single path. Once again, this helps to improve the chance that a future request will be successfully accommodated. The subsequent decrease in ACAR with increasing Nm is attributable to the limited number of disjoint paths available in the network. The maximum number of link-disjoint paths between nodes in the networks used for these studies was 5, with a majority having only 4 such paths. Consequently, the ACAR dropped steeply once Nm crossed 4. (b) Effect of Backup Multiplexing: Figure 2 illustrates the effect of backup multiplexing on ACAR. The figure shows two non-multiplexed versions of the proposed scheme - (1; 0; 1) and (2; 0; 1) along with the corresponding multiplexed versions - (1; 0; 1) and (2; 0; 1). As expected, the acceptance rate of all the four versions drops with increasing load. The higher dispersity systems (2; 0; 1) and (2; 0; 1) also outperform the other two versions due to the reasons outlined previously in result (a). More importantly, the application of backup multiplexing helps to generate a marked increase in ACAR at all values of load and for both types of dispersity systems. Clearly, the figure shows that a combina-
(1,0,2)* (2,0,2)* (2,3,1)*
0.6 0.8 1 1.2 Recovery Time (ms)
Figure 4. Recovery time distributions
tion of both dispersity and backup multiplexing (represented by (2; 0; 1)) substantially outperforms either pure dispersity (represented by (2; 0; 1)) or pure backup multiplexing ((1; 0; 1)). (c) ACAR of the Various Schemes: Figure 3 shows the acceptance rates of four versions of the proposed scheme (1; 0; 1) representing pure backup multiplexing , (2; 3; 0) representing pure dispersity based forward recovery , and (2; 0; 1) and (2; 3; 1) representing a combination of both. Of the four schemes, (2; 0; 1) was by far, the best in terms of ACAR. Its better acceptance rate compared with (1; 0; 1) is due to the reasons outlined in result (a). The (2; 3; 0) system, despite using 2:5 times as much bandwidth as a single primary channel2 , is able to almost match the (1; 0; 1) system in performance. This indicates that the bandwidth splitting and load distribution inherent in the dispersity system (with forward recovery) is able to almost offset the increased bandwidth used by the system. The (2; 3; 1) system showed slightly lower acceptance rates than the (2; 3; 0) system because of the bandwidth reserved for the backup sub-channels, but it has better fault-tolerance than (2,3,0). (d) Average Recovery Time distributions: Figure 4 plots the cumulative distribution3 of the ART metric for three schemes - (1; 0; 2), (2; 0; 2), and (2; 3; 1). All the experiments for this figure were run for the following permanent fault specification: 75% of the faults were single-link faults whereas 25% were double link faults. To counter the double link faults, the first two schemes were equipped with two backups each. Since the (2; 3; 1) scheme could counter a single fault through its inherent redundancy 4 due to its FEC sub-channels, it was equipped with only one backup subchannel. Clearly, the (2; 3; 1) scheme gives better performance compared to the other two schemes, as it occupies the more favorable upper portion of the plot. This observation can be explained as follows: For the (1; 0; 2) and the (2; 0; 2) schemes, there is a marked time gap between the detection of a link fault and the activation of the backup subchannel. This includes the time taken for the fault notification to travel from the faulty component to the source node of the connection, and then the time taken for the source to switch to the backup channel and start sending packets through it. On the other hand, because of its redundancy, the (2; 3; 1) scheme can continue to successfully transmit packets, even as the notification is being forwarded. The destination node will merely treat the missing packets from the affected sub-channel as a transmission error and apply the necessary error-correction mechanisms. Once the no-
2 Total bandwidth required for the five sub-channels of (2; 3; 0) = 2+3 ? 2 bandwidth used by a single channel. 3 The term cumulative distribution is to be interpreted as follows: if a point in the graph is represented by the coordinates (x; y), it indicates that the recovery time for y% of the total number of faults was less than x ms. 4 Note that using this redundancy to counter link faults does take away the ability of the system to counter transmission errors, once a fault has actually taken place.
tification message reaches the source, it can switch to the backup channel and continue transmission. Clearly, the recovery time for this scheme is much smaller than that for the other two schemes.
5. Conclusions In this paper, we have proposed an integrated scheme for establishing dependable real-time channels in multihop networks that combines the attractive features of the basic faulttolerant communication approaches. We have also studied the effectiveness of the proposed scheme, through simulation studies, by comparing it with two well known algorithms based on the basic approaches. From our simulation studies, we note that the proposed scheme which combines dispersity and backup multiplexing is able to successfully bring together, the various advantages of both of these basic approaches. From dispersity, the proposed scheme inherits better acceptance rates (due to bandwidth splitting and load balancing) and the ability to counter transmission errors (through FEC techniques). From backup multiplexing, the proposed scheme inherits the ability to reduce resource usage through effective bandwidth sharing among backup channels of different connections.
References  D. Ferrari and D.C. Verma, “A scheme for real-time channel establishment in wide-area networks”, IEEE Journal of Selected Areas in Communications, vol.8, no.3, pp.368-379, Apr. 1990.  H. Zhang, “Service disciplines for guaranteed performance service in packet-switching networks,” Proc. IEEE, vol.83, no.10, pp.13741396, Oct. 1995.  E. Gustafsson and G. Karlsson, “A literature survey on traffic dispersion,” IEEE Network, pp.28-36, March/April 1997.  A. Banerjea, “Simulation study of the capacity effects of dispersity routing for fault-tolerant real-time channels,” ACM SIGCOMM, pp.194-205, 1996.  S. Han and K.G. Shin, “A primary-backup channel approach to dependable real-time communication in multihop networks,” IEEE Trans. Computers, vol.47, no.1, pp.46-61, Jan. 1998.  A. Banerjea, C. Parris, and D. Ferrari, “Recovering guaranteed performance service connections from single and multiple faults,” Tech. Rep. TR-93-066, Univ. of California, Berkeley, 1993.  Q. Zheng and K.G. Shin, “Fault-tolerant real-time communication in distributed computing systems,” IEEE Fault-Tolerant Computing Symposium, pp.86-93, 1992.  J. Anderson, B. Doshi, S. Dravida, and P. Harshavadhana, “Fast restoration of ATM networks,” IEEE Journal of Selected Areas in Communications, vol.12, no.1, pp.128-138, Jan. 1994.  R. Kawamura, K. Sato, and I. Tokizawa, “Self-healing ATM networks based on virtual path concept,” IEEE Journal of Selected Areas in Communications, vol.12, no.1, pp.120-127, Jan. 1994.  K. Murakami and H. Sim, “Near-optimal virtual path routing for survivable ATM networks,” IEEE INFOCOM, pp.208-215, 1994.  S. Han and K.G. Shin, “Efficient spare-resource allocation for fast restoration of real-time channels from network component failures,” IEEE Real-Time Systems Symposium, 1997.