1174
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Neural and Fuzzy Computation Techniques for Playout Delay Adaptation in VoIP Networks Mohan Krishna Ranganathan and Liam Kilmartin, Member, IEEE
Abstract—Playout delay adaptation algorithms are often used in real time voice communication over packet-switched networks to counteract the effects of network jitter at the receiver. Whilst the conventional algorithms developed for silence-suppressed speech transmission focused on preserving the relative temporal structure of speech frames/packets within a talkspurt (intertalkspurt adaptation), more recently developed algorithms strive to achieve better quality by allowing for playout delay adaptation within a talkspurt (intratalkspurt adaptation). The adaptation algorithms, both intertalkspurt and intratalkspurt based, rely on short term estimations of the characteristics of network delay that would be experienced by up-coming voice packets. The use of novel neural networks and fuzzy systems as estimators of network delay characteristics are presented in this paper. Their performance is analyzed in comparison with a number of traditional techniques for both inter and intratalkspurt adaptation paradigms. The design of a novel fuzzy trend analyzer system (FTAS) for network delay trend analysis and its usage in intratalkspurt playout delay adaptation are presented in greater detail. The performance of the proposed mechanism is analyzed based on measured Internet delays. Index Terms—Fuzzy delay trend analysis, intertalkspurt, intratalkspurt, multilayer perceptrons (MLPs), network delay estimation, playout buffering, playout delay adaptation, time delay neural networks (TDNNs), voice over Internet protocol (VoIP).
I. INTRODUCTION
T
HE viability of employing packet switched networks as a transport medium for real-time voice communications has drawn wide interest among both research and commercial communities alike. Whilst the recent advances in speech codecs and high speed data communication technologies promote the interest in technologies such as voice over Internet protocol (VoIP), widespread commercial deployment of VoIP, for example, has been restricted due to a number of technical challenges posed by the nature of IP networks [1], [2]. The statistical nature of data traffic and the dynamic routing techniques employed in packet-switched networks results in a varying network delay (jitter) experienced by IP packets. As a result voice packets generated at successive and periodic intervals at the source will typically arrive at the receiver at irregular intervals. One of the compelling challenges is to reconstruct a continuous stream of speech at the destination in the face of a stochastically varying network delay. Manuscript received November 1, 2003; revised March 22, 2005. This work was supported by Enterprise Ireland and Nortel Networks, Galway under Enterprise Ireland’s Applied Research Programme. M. K. Ranganathan is with Sasken Communication Technologies Limited, Bangalore 560071, India (e-mail:
[email protected]). L. Kilmartin is with the Department of Electronic Engineering, National University of Ireland, Galway, Ireland (e-mail:
[email protected]). Digital Object Identifier 10.1109/TNN.2005.853418
The generally accepted solution to smooth out the effects of jitter is to buffer the received audio packets before playing them out in their temporal sequence of generation [3]. The playout of received audio packets from this buffer is postponed by a “certain” amount of time, to allow subsequent longer delayed packets to arrive at the receiver ahead of their scheduled playout times. Those packets which still do not arrive within their delayed playout schedules are deemed lost and are discarded. Loss of heavily delayed packets along with the loss of any packets while in transit through the network result in discontinuities in the resultant voice stream, thus, adversely affecting the perceived quality of the output speech. While minimizing network packet loss could prove to be an arduous task, a judiciously chosen buffering time, or the additional delay introduced before playout, can significantly reduce the late packet loss rate, thereby improving the perceived voice quality. However, if the selected buffering time also results in a considerable increase in one-way delay, it will have an unwanted detrimental effect on the two-way conversation and the overall communication could eventually be effectively forced into a half-duplex mode. The accepted one-way delay for approximate toll quality speech is no more than 150 ms [4] but an end-to-end one way delay of 150 ms to 400 ms is generally acceptable for long-haul voice communications over packet-switched networks. In general, the objective of any buffer management algorithm would be to achieve an optimal balance in the tradeoff between the additional delays introduced versus the late packet loss rate. The playout delay of the voice packets needs to be continuously adapted in order to maintain the desired balance between late packet loss and tolerable additional delay over the entire duration of the voice call. Two different paradigms for adapting the playout delay exist. The first (and most commonly implemented) solution is suited for use in silence suppressed speech transmission scenarios, where the playout delay is set for individual talkspurts. Using an estimate of the network delay of upcoming voice packets, the playout delay is varied only at the beginning of a new talkspurt resulting in either compression or expansion of silent periods while the temporal structure of packets within a talkspurt is maintained intact [5]. Such mechanisms will henceforth be referred to as “intertalkspurt” playout delay adaptation methods. Recently, there has been shift in focus from the conventional talkspurt level playout delay adaptation methods to more advanced adaptation “within-talkspurts” (i.e., intratalkspurt) mechanisms as demonstrated in [6] and [7]. The methods presented in [6] and [7] make use of time-scale modification algorithms to alter the playout length of individual speech
1045-9227/$20.00 © 2005 IEEE
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
packets so as to adapt these speech segments to the varying network delays within a talkspurt while still maintaining a relatively smooth voice playout. The efficiency of the playout delay adaptation process, of both intertalkspurt and intratalkspurt-based paradigms, relies heavily on the quality of estimation of the network delay to be experienced by future voice packets. A wide variety of network delay estimation techniques have been proposed and Laoutaris and Stavrakakis [8] have provided a comprehensive survey of these existing mechanisms. Soft computing (SC) technology, encompassing information processing mechanisms such as neural network, fuzzy computing, neuro-fuzzy computing, generic algorithms, expert systems, etc., have found wide applicability in solving various real-world engineering and control problems and have generally been shown to perform better than traditional solutions. For example, the use of neural networks in econometric estimations and general time series prediction [9] has been well explored. In relation to playout delay adaptation, the problem of network delay estimation is quintessentially a time series estimation application and, thus, warrants the use of soft-computing techniques. Intratalkspurt-based playout delay adaptation operates on a more closely bound “additional delay versus packet loss” tradeoff compared to the intertalkspurt case. Hence, the intratalkspurt scenario requires more accurate estimates of network delay, for example network delay estimates as provided by Concord algorithms [10]. This paper presents a novel fuzzy trend analyzer system (FTAS)-based playout adaptation process specifically for the intratalkspurt scenario and simulation results have shown promising improvement in the “delay-versus-loss” tradeoff of this new mechanism when compared to the Concord algorithms. The design of the FTAS is presented in greater detail in this paper along with a comprehensive perceptual-based analysis comparing the performance of FTAS and Concord mechanisms for intratalkspurt playout delay adaptation. Speech samples covering a wide range of speakers and languages extracted from the speech sample database [11] provided by the International Telecommunication Union (ITU) were used in the analysis which was performed using measured Internet delay traces. The formulation and notations of the playout delay adaptation problem are presented in Section II. Section III focuses on the application of neural network techniques for the intertalkspurt playout adaptation scenario, wherein the use of multilayer perceptron (MLP) neural networks and time delay neural networks (TDNNs) is presented followed by a comparative simulation-based analysis of the discussed methods. The concept of intratalkspurt playout delay adaptation is detailed in Section IV. Section V is devoted to the detailed description of the novel FTAS designed specifically for use in the intratalkspurt playout delay adaptation scenario. The proposed use of FTAS is supported by extensive simulation-based analysis. The algorithm evaluation methodology and the results comparing the performance of FTAS to that of Concord algorithms are described in Section VI. The merits of employing soft-computing techniques and their drawbacks are discussed in Section VII.
1175
II. PLAYOUT DELAY ADAPTATION CONCEPT The media processing at the source terminal of a VoIP conversation begins with the real-time segmentation of digitised voice signal into fixed length frames suitable for transmission over IP (packet-switched) networks. Voice codecs such as G.711, G.723, G.729, etc. are generally used to compress the speech frames prior to transmission so as to conserve network bandwidth. Thus, speech frames are generated periodically at intervals equivalent to the duration of speech held in each frame. The real-time transport protocol (RTP) [12] is used to append timing and sequence information to individual speech frames and the RTP encapsulated packets are ultimately dispatched to their destination as UDP payloads. Another network bandwidth conservation technique employed in VoIP networks is silence suppression. Conversational speech is composed of durations of active voice (talkspurts) interleaved with silence intervals. Packets are transmitted from the voice source during talkspurts only and the transmission is suspended during any silent periods. The timing information appended to the transmitted packets is sufficient to reconstruct the original talkspurt-silence structure of the utterance at the receiver terminal. En route to the destination, the voice packets are multiplexed with other data packets on the network and will typically encounter varying queueing delays at network entities. The packets could also be dropped while traversing heavily congested networks. Dynamic routing techniques used in modern network routers could force the voice packets pertaining to the same voice stream to travel by different network paths to the destination. Depending upon the nature of background data traffic on the network, the delays experienced by voice packets could show larger variations, either gradual or abrupt. The variation in network delays (jitter) and the amount of packet loss is generally unpredictable and depends on the short-term network conditions. Consequently, the significant challenge is to reconstruct a continuous and smooth voice stream at the receiver from the irregularly received voice packets. The voice stream reconstruction is generally achieved by employing a buffering mechanism at the receiver. The received voice packets are held in a buffer for a certain amount of time before playout so as to absorb the effects of network jitter. The voice packets are then played out from the buffer sequentially at the same regular intervals at which they were generated at the source. As a result, the duration between the instance at which a packet is generated at the source and its instance of playout from the receiver, termed playout delay, would be same for all voice packets. Buffered playout of voice packets can also be referred to as a mechanism that translates a varying network delay experienced by voice packets to a constant end-to-end playout delay. The additional delay, introduced by the buffering process, allows any packets which experience longer delays to arrive in time for their playout from the buffer. Heavily delayed packets that do not arrive even within their delayed playout schedules are deemed lost and discarded. Suitable packet reconstruction techniques [13] have to be employed to counter the effect of packet loss in order to produce a relatively smooth voice playout. The playout delay of voice packets is a highly critical factor for improving quality of service (QoS) since larger playout
1176
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
tion procedure of the associated variation in network delay is given in (2) (1) (2)
Fig. 1. Timing diagram demonstrating intertalkspurt playout delay adaptation. The ith packet of the k th talkspurt, generated at the source at instance t , is shown to arrive at the receiver at instance a after experiencing a network delay of n . The time instance at which the packet is played out from the buffer is p t , is fixed same for all packets given as p . The playout delay, within the k th talkspurt. The playout delay is adapted at the beginning of the talkspurts, as required, depending on the current network delay conditions. As is shown to be larger than an indication of the playout delay adaptation, .
pd =
pd
0
pd
delays may minimize the late packet loss rate but conversely would also prove detrimental to the two-way conversational quality. The goal of the buffering mechanism is essentially to achieve an optimal balance between the acceptable late packet loss rate and the acceptable additional delay over the entire duration of the voice transmission. A fixed playout delay would suffice for the entire call duration if the network delay characteristics were known a priori. However, in practice, network delays are observed to be highly nondeterministic and are expected to vary significantly over a call’s duration. The more efficient buffering mechanisms used in such cases continuously monitor the network delay and respond to significant changes in the observed network delay by adapting the playout delay. A number of traditional solutions performed the adaptation of playout delay only at the beginning of a talkspurt in silence suppressed speech transmission scenarios (intertalkspurt-based methods). Once adapted, the playout delay is maintained the same for all packets within individual talkspurt. The resulting expansion or compression of silence intervals in the reconstructed speech is considered to be more tolerable [3]. A number of algorithms, and improvements thereupon, have been proposed [8] and one of the seminal solutions is the linear recursive filter (LRF)-based solution [5]. In describing the intertalkspurt playout delay adaptation algorithms, the notations depicted in Fig. 1 are used for the various time instances and durations associated with playout delay adaptation of voice packets. Though the time axis shown in Fig. 1 suggests that the sender and receiver clocks are synchronised, no such assumptions are made by any of the playout delay adaptation algorithms reviewed in this paper. The algorithms presented are concerned only with the variation in network delay for their operation and the absolute timings are not significant. The linear filtering process is applied on the network delays experienced by received voice packets in order to obtain a continuous smoothed estimate of network delay and an associated estimate of the variation in network delay. The estimate of netis computed as shown in (1) and the computawork delay
The filter coefficient controls the responsiveness of the filter and different values for are suggested in [5]. The playout time is computed as shown of the first packet of the th talkspurt in (3). The rest of the packets of that talkspurt are played out in their temporal sequence with the same playout delay as deterin (3) is again algomined in (4). The value of the constant rithm dependent (3) (4) In [5], four different algorithms are proposed of which only the first two algorithms are considered here for discussion. Algorithm 1 is the simplest, which uses (1) to (4) directly. A value of 0.998002 is suggested for the filter coefficient and a value in (3). of 4 for the constant Algorithm 2 is a suggested improvement over algorithm 1 and it suggests the use of different filter coefficients for the increasing and decreasing network delay cases. It is desirable for the filter to respond quickly to increases in the network delay whereas responding more slowly to decreases in the network 0.75 is suggested delay. To this end, a filter coefficient of for the case when the network delay is found to be greater than 0.998002 the previous estimate, while the original value of is retained for the case where the network delay input to the filter in (3) is set is less than the previous estimate. The constant equal to 4. Algorithm 2 is identified as the LRF-based algorithm in the rest of this paper. III. NEURAL NETWORK BASED INTERTALKSPURT PLAYOUT DELAY ADAPTATION SOLUTIONS Following the discussions presented in the previous section, the network delay estimator can be identified as the fundamental component of an intertalkspurt playout delay adaptation system. The issue of network delay estimation can be distinguished as an application of general time series analysis techniques. Time series analysis usually involves predicting or estimating the future values of the series based on knowledge of past values (history). It is well acknowledged that the future values of a time series can be modeled as a function of the current value and past values of the same series, and that a close approximation of the function can be realized using a set of examples of past and future values of the time series [14]. However, arriving at a reliable approximation of the function which governs the relationships between the future and past values of a time series by analytical means can prove to be a challenging task. The capabilities of self-learning systems, such as neural networks, are widely acknowledged as being very applicable to this class of problem. Tien and Yuang [15] have demonstrated the use of neural networks in adapting the playout of voice frames in ATM net-
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1177
TABLE I PLAYOUT DELAY ADAPTATION TIMING NOTATIONS
Fig. 2. Procedure for estimating the mean and standard deviation of network delay for the next block of packets. The “estimator” block is implemented using either the MLP neural network or a TDNN.
works, wherein an MLP network was employed to predict the frame characteristics of an upcoming voice talkspurt. In contrast to the MLP architecture, the TDNN is a more recently proposed form of neural network suited for temporal-based applications. This paper proposes the use of a time delay neural network for intertalkspurt delay estimation and compares its performance against that of the MLP based and conventional LRF-based techniques for playout delay adaptation outlined in [16]. In the proposed predictive playout delay adaptation method, the intertalkspurt playout delay adaptation is performed based on generating a running estimate of the mean and standard deviation of network delay experienced by voice packets. The received packets are conceptually grouped into blocks and the mean and standard deviation of the network delay is computed for each block of packets. In order to keep the effect of noisy variations to the minimum and to obtain useful measures, the block size is kept longer and blocks are allowed to overlap. A block size of 100 packets with an overlap of 75 packets was used in the analysis of these mechanisms. Different overlap sizes were evaluated, but no visible impact on the performance of the estimation methods was observed. The notations used in describing the playout delay adaptation procedure are in Table I. A suitable estimation mechanism is used to predict the mean and standard deviation of network delay (as shown in Fig. 2) based on the same data obtained for the past blocks. The playout delay of the first packet of the th talkspurt is set by (5) where and are appropriately chosen constants and and are the most recent estimates of the mean and standard deviation of network delay computed over blocks of packets. is typically set to 1 whilst the value for is varied to suit particular scenarios. The rest of the packets within the talkspurt are replayed according to their relative temporal position within the talkspurt. The different alternatives proposed for the implementation of the estimator of Fig. 2 are the MLP neural network and the TDNN. A. Neural Network Estimation Architectures 1) MLP Neural Network Solution: The configuration of the MLP network used as a network delay estimator for the purpose of the playout delay adaptation of voice packets is shown in Fig. 3. The MLP neural network is used to estimate the mean and standard deviation of network delays that would be experienced by the “next” block of voice packets, based on the same parameters computed for a specified number of previous
Fig. 3.
MLP-NN-based network delay statistic estimation architecture.
or the number of blocks of packets. The size of history past blocks used for the estimation process needs to be suitably configured. If a large pool of previous statistics is used, the estimations might be unable to follow short-term variations, whereas a short size of history would result in the estimations following short-term variations in network delay characteristics more closely than is suitable for playout delay adaptation. A history size of 3 to 5 blocks was determined to be a suitable range after much experimentation. A three layered MLP architecture with 12, 12, and 2 neurons in each layer, respectively, was used in the analysis experiments. The inputs to the network were obtained from the current and 4). At the beginning past three blocks of packets (i.e., of each block, prior to estimating the network delay statistics of the upcoming block, the network is trained with information from the directly previous block being considered as the desired network output. The standard error back-propagation algorithm [17] is used in sequential mode for training the network with a learning rate of 0.01. Delay information from the four directly previous blocks are used as inputs for training. A concern with neural network-based methods is that, initially, until the network trains and converges to the desired signal within some error limit, the output estimates deviate largely and are not suitable for playout adaptation purposes. In such a scenario, a fixed playout delay is suggested for initial talkspurts, until the neural network converges to the desired prediction range. 2) TDNN Solution: TDNNs were developed recently for speech processing applications [18], [19]. The main advantage of TDNN has been its ability in interpreting temporal relationships. TDNNs are in essence feedforward networks, similar to MLP networks, but are distinguished by the presence of tapped delay lines on every synaptic input of each neuron. The
1178
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Fig. 4. Architecture of TDNN neuron unit showing the input delay taps. An individual synaptic weight is assigned to each of the input delay taps.
architecture of a single TDNN neuron unit is shown in Fig. 4. Each of the delayed input lines have individually associated weights which need to be adapted while training the neural network. The output of the TDNN neuron is, thus, given by (6) where
As an effect of the presence of tapped input lines, the outputs of a layer of neurons in TDNN depend not just on the current output of the previous layer, but also on the past outputs, where is the number of delay taps. In essence, TDNNs appear to be more suitable for time series analysis, and in particular TDNNs are expected to provide more accurate estimations of the network delay statistics. The training process of TDNNs is more complex compared to the inherently static MLP networks. Much of the complexity is due to the presence of tapped-delay lines on the inputs of neurons in all layers. An elegant solution for training the TDNN is the temporal back-propagation algorithm which is suggested in [9] for training a finite impulse response (FIR) network. The FIR network developed by Wan [9] is another functionally equivalent structure to the TDNN. The temporal back-propagation algorithm has been published in different forms [20], [21] and the procedure for adapting synaptic weights used in this research is given in [21] (labeled algorithm IC-1). A learning rate of 0.01 is used in the training procedure. The TDNN architecture used for estimating network delay characteristics is similar to the MLP architecture described in Section III-A.1. However, as TDNN inherently has a memory of the previous input values, it is not required to supply inputs from a window over past blocks of packets. Thus, the mean and standard deviation of network delay computed over only one block of packets is fed as the inputs to the TDNN estimator shown in Fig. 5. The network structure used in the performance analysis consisted of three layers with 12, 12, and 2 neurons in each layer. In addition, the individual layers incorporated 3, 4, and 3 input time delays, respectively. B. Algorithm Comparison Methodology 1) Voice Packet Trace Simulation: In order to analyze and compare the performance of the neural network-based playout
Fig. 5.
Architecture of TDNN-based network delay statistic estimator.
delay adaptation algorithms and the traditional LRF-based approach an offline simulation scheme is used. The MLP, TDNN, and LRF approaches are used to process packets from the same RTP voice packet trace. A set of four different RTP voice packet traces that are constructed based on packet network delays measured on different Internet transmission paths and an additional RTP trace constructed based on simulated network delays are used in the analysis of the adaptation algorithms. Internet packet delays were measured by transmitting packet streams from a host located at National University of Ireland, Galway (NUIG), Ireland to two other hosts, the first of which is located at University of New South Wales (UNSW), Sydney, Australia, and the other at Dublin City University (DCU), Ireland. The details of the four different measured network delay traces obtained are given in Table II. While performing the delay measurements, the sender and receiver clocks were not synchronised and, hence, the measured network delays are relative. However the measured delays reflect the true variation in network delay. In addition, a simulated network delay trace with known statistical parameters is also used in the performance analysis of the intertalkspurt buffering algorithms. A shifted gamma distribution function is chosen for simulating individual packet network delays following the discussion on Internet delay statistics presented in [22]. The employed procedure for simulating packet network delays is given by (7) where is a fixed value reflecting the fixed component of the is a random value generated using the network delay. gamma probability distribution function with the shape parameters and . The random values represent the variable component of delay experienced in real networks. In the analysis presented, the network delay generation parameters used are 50 ms, 25.0, and 0.004. RTP voice packet traces are simulated based on the talkspurt-silence model suggested by Sriram and Whitt [23], wherein the talkspurt lengths are exponentially distributed with a mean of 352 ms and the silence durations are exponentially distributed with a mean of 650 ms. RTP packet traces are constructed by combining a simulated sequence of talkspurt
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1179
Fig. 6. Results from the performance analysis of the proposed intertalkspurt adaptation mechanisms, based on the simulated network delay trace. Note that the buffering delay curves, shown in (a), for MLP and TDNN-based methods coincide with each other. (a) Additional buffering delay comparison. (b) Packet loss rate comparison. (c) Additional delay versus packet loss tradeoff. TABLE II INTERNET DELAY TRACES
and silence lengths with the measured and the simulated network delay traces. It is to be noted that no silence suppression was assumed during the network delay measurement process. The traces described in Table II were obtained for packets transmitted continuously every 20 or 40 ms. When constructing the RTP packet trace based on a particular measured network delay trace, the recorded delays from that network delay trace corresponding to the silent periods of the RTP packet trace are
ignored. All the five constructed RTP packet traces comprised of 300 talkspurts. 2) Performance Analysis Metrics: The efficiency of a playout delay adaptation algorithm is mainly quantified in terms of the tradeoff between the average additional delay introduced due to buffering and the minimization in late packet loss rates achievable. 1) Additional Buffering Delay: The additional delay introduced due to buffering, computed for every successfully played packet, is defined as the duration between the instance of arrival of the packet at the receiver and its . The mean buffering delays within playout, i.e., each talkspurt are averaged over all the talkspurts and represents the average additional buffering delay metric. 2) Late Packet Loss Rate: The late packet loss rate within each talkspurt is computed as the ratio of the number of packets arriving late for playout (i.e., packets for which ) to the total number of packets in the talkspurt. Thus, computed talkspurt loss rates are averaged
1180
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Fig. 7. Results from the performance analysis of the proposed intertalkspurt adaptation mechanisms, based on the NUIG-DCU (20 ms) recorded network delay trace. Note that the buffering delay curves shown in (a) for MLP and TDNN-based methods coincide with each other. (a) Additional buffering delay comparison. (b) Packet loss rate comparison. (c) Additional delay versus packet loss tradeoff.
over all the talkspurts in the trace and reflect the average late packet loss rate for the trace. The average values of the performance metrics are computed over only those talkspurts that are adapted after the neural network has converged within a desired prediction error limit to limit the focus of analysis on the estimation capabilities of the neural network-based methods. C. Analysis Results and Discussion Fig. 6 illustrates the performance of the three intertalkspurt playout delay adaptation methods, i.e., the LRF-based method, the MLP network-based method and the TDNN-based method, for the RTP voice packet trace based on simulated network delays. The graphs shown in Fig. 6(a) and (b) are obtained by varying the buffering coefficient, , of (3) and (5). As seen from Fig. 6(a), the LRF-based method introduces very high buffering delay compared to both the neural network-based methods, which interestingly show very close performance to each other. Fig. 6(b) compares the performance of the neural network-based methods with that of the conventional LRF-based method for the late packet loss rate performance
metric. Though the neural network methods demonstrate higher packet loss rates (for the lower values of buffering coefficient) the loss rates offered by the neural network methods are lower than those which are experienced when using the LRF method (for large values of the buffering coefficient). The TDNN-based method shows better performance than the MLP-based method. The graphs of Fig. 6(a) and (b) are combined together and represented in Fig. 6(c). The graph of Fig. 6(c) illustrates more clearly, the tradeoff between additional delay and packet loss rates offered by the different estimation methods. It is evident from this illustration that the TDNN method provides the best relative performance among the three intertalkspurt playout delay adaptation methods. Figs. 7 to 10 illustrate the results obtained through a similar comparison of the performance of the intertalkspurt playout delay adaptation algorithms for the four different RTP voice packet traces based on measured Internet packet delays. The performance behavior of the neural network and LRF-based intertalkspurt playout delay adaptation algorithms for the measured network delays is highly similar to their behavior demonstrated by the analysis based on simulated network delay trace.
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1181
Fig. 8. Results from the performance analysis of the proposed intertalkspurt adaptation mechanisms, based on the NUIG-DCU (40 ms) recorded network delay trace. Note that the buffering delay curves shown in (a) for MLP and TDNN-based methods coincide with each other. (a) Additional buffering delay comparison. (b) Packet loss rate comparison. (c) Additional delay versus packet loss tradeoff.
The graphs of Figs. 7(c), 8(c), 9(c), and 10(c), comprehensively illustrate that the TDNN-based method provides the best relative performance among the three discussed intertalkspurt playout delay adaptation algorithms. IV. INTRATALKSPURT PLAYOUT DELAY ADAPTATION The intertalkspurt playout delay adaptation methods, reviewed in previous sections, focused entirely on preserving the original uniform spacing and periodic nature of voice packets within a talkspurt. The performance of such mechanisms is heavily dependent on the reliability of short term network delay estimations. For example, efficient intertalkspurt adaptation requires an estimate of delay which would be slightly greater than the network delays to be experienced by all (or most of) the packets belonging to the upcoming talkspurt. Arriving at such estimates could prove to be impractical, especially in the case of large jitter being experienced by voice packets within a talkspurt. As a result, the buffering mechanism is rendered incapable of recovering from an inferior value chosen for the playout delay at the beginning of a talkspurt.
More recently, mechanisms have been developed for adapting the playout delay within talkspurts (intratalkspurt methods) [6], [7]. The intratalkspurt playout delay adaptation mechanisms aims at achieving a better additional delay versus late packet loss tradeoff and, hence, an improvement in voice reconstruction quality. The adaptation of playout delay within talkspurts is made feasible by the use of mechanisms for time-scale modification of speech. Individual voice packets are time-scaled (i.e., expanded or compressed in duration) to either increase or decrease the playout delay within the lifetime of a talkspurt. Liang, et al., [6] have employed the waveform similarity overlap-add algorithm (WSOLA) [24] of time-scale modification for the purpose of intratalkspurt playout delay adaptation. The notations used in describing the intratalkspurt playout delay adaption scenario are given in Table III. Note that no reference is made to the talkspurt index, as the playout of each packet is adapted individually. A graphical illustration of intratalkspurt playout delay adaptation procedure is given in Fig. 11. An example network delay , pattern is used to illustrate the process. Packets are shown to experience gradually increasing netand
1182
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Fig. 9. Results from the performance analysis of the proposed intertalkspurt adaptation mechanisms, based on the NUIG-UNSW (20 ms) recorded network delay trace. (a) Additional buffering delay comparison. (b) Packet loss rate comparison. (c) Additional delay versus packet loss tradeoff.
work delays. The packet is shown to undergo a sudden drop in network delay and a further slightly reduced network th packet. The delay is shown to be experienced by the prominent feature of intratalkspurt method is the adaptation of the playout delay for every voice packet. The increase or decrease in playout delay is achieved by either time-expanding or time-compressing the packets, respectively. In Fig. 11, packets , and are shown to be expanded before playout in response to the underlying increase in network delay. and are shown to be compressed in time Packets in reflection to the decrease in network delays experienced by the packets. As an effect of the compression and expansion of packets, the final playout lengths of the packets are distinctively different from their original constant lengths with which they were generated at the source end of the communication. The decision to expand or compress the voice packets in intratalkspurt scheme is determined by the playout delay adaptation mechanism in use. The playout delay adaptation is based on a reliable estimate of the network delay that would be experienced by the immediate next packet. Such an estimate could be obtained, for example, by the use of the Concord mechanism [10] a review of which follows.
The Concord mechanism [10] was developed as an efficient solution for playout delay adaptation to operate under user specified QoS constraints. The utility of Concord for intratalkspurt playout delay adaptation is reviewed by the authors in [25]. The salient feature of the Concord algorithms is the use of a statistical approximation of the network delay distribution in computing the playout delay. The notations used in describing the Concord algorithm are given in Table IV. Fig. 12 outlines the total end-to-end delay (ted) threshold computation from an example packet delay distribution (PDD). For any chosen ted, all packets experiencing network delay greater than the ted threshold would be late for playout and, hence, discarded. The late packet loss ratio is com, which puted from the delay distribution curve as is equivalent to the area under the PDD curve to the right of the ted line. For playout delay adaptation, the ted threshold is chosen as the smallest value of delay that satisfies the condition . Concord also allows for a limit (mad) on the maximum value for ted. For every voice stream, an approximate PDD curve is determined using a histogram approach. A histogram of packet network delays is constructed, with each histogram bin repre-
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1183
Fig. 10. Results from the performance analysis of the proposed intertalkspurt adaptation mechanisms, based on the NUIG-UNSW (40 ms) recorded network delay trace. Note that the least additional buffering delay offered by the LRF method is considerably greater than the maximum additional buffering delay introduced by either of the neural network-based mechanisms. As a result, the delay-loss tradeoff curve for the LRF-based method does not appear in (c). TABLE III INTRATALKSPURT PLAYOUT DELAY ADAPTATION TIMING NOTATIONS
senting a range of network delays identified by the bin width. On every packet arrival, the value of the histogram bin corresponding to the network delay experienced by the received voice packet is incremented. Over time, the histogram grows to represent a statistical approximation of the nature of the underlying network delay distribution. The PDD curve is computed from the histogram bin values using (8) PDD
(8)
Fig. 11. Example intratalkspurt playout delay adaptation schema with packet timings.
As the distribution of the network delays could vary considerably over the lifetime of a voice stream, the PDD curve must be continuously updated so as to maintain a more realistic estimate of network delay. In Concord, this is achieved by using aging algorithms to gradually discard or retire any old information from
1184
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Fig. 12. ted computation using an example PDD curve. The area under the PDD curve to the right of the ted line indicates the late packet loss ratio. The ted line is moved left or right in order to obtain a packet loss ratio less than or equal to mlp. The figure is reproduced from [10].
TABLE IV NOTATIONS USED IN DESCRIBING THE CONCORD ALGORITHM
the network delay histogram. The process of aging of the network delay histogram is dependent on two control parameters. • : The aging coefficient parameter, which can be specified by the user/application and is limited to the range [0.0, 1.0]. : The aging frequency, specified in terms of number of • packets. The histogram aging process is performed every packets by multiplying each histogram bin value with the aging factor . Every aging instance is followed by the recomputation of the PDD curve. In [10], three different algorithms are presented for computing the aging factor . The aging algorithm 2 of [10] is chosen here for analysis. Equation (9) represents the corresponding procedure for computing this aging factor
delay rather than the absolute fluctuations in network delay experienced by individual voice packets. Adapting to the noisy and smaller variations in network delay, in intratalkspurt playout delay adaptation scenario, could result in unnecessary compression and expansion of voice packets. The desired features of a decision mechanism for intratalkspurt playout delay adaptation are the following. 1) The playout delay should follow the underlying trend in network delays rather than absolute fluctuations of network delay. 2) The adaptation system should be capable of selectively ignoring small variations in network delay while responding to the more evident larger changes (both increases and decreases in network delay). The responsiveness of the system to small variations needs to be controllable. 3) The system needs to be more responsive to increases in network delay than decreases in network delay. Though the existing mechanisms for playout delay adaptation collectively satisfy the previously discussed features, there is a lack of a single mechanism which is particularly suited for the intratalkspurt playout delay adaptation paradigm. It is perceived that the advantages of intratalkspurt playout delay adaptation using time-scale modification of voice packets, could be better exploited with a more suitable decision mechanism for intratalkspurt playout delay adaptation. A major contribution of this research is the development of such a decision mechanism for intratalkspurt playout delay adaptation. The developed adaptation mechanism relies on a measure of the network delay trend. A. Network Delay Trend Analysis The desired output of network delay trend analysis would be a measure, or numerical representation, of linguistic observations such as “the network delays are relatively constant” or “the network delays are displaying a steep increase.” A viable mechanism for identifying the current network delay trend would be to observe the change in network delay experienced by the past few voice packets and arrive at a measure based on these observations. Consequently, the trend analysis procedure can be represented as the function (10)
(9)
In all of the previous aging algorithm equations although a continuous integral is used to represent the aggregation of the histogram bin contents, in practice a discrete summation of the histogram bin contents is performed to determine the equivalent result. V. DEVELOPMENT OF THE FTAS FOR INTRATALKSPURT PLAYOUT DELAY ADAPTATION In playout delay adaptation, it is desirable for the playout delay to follow the underlying level or trend of the network
where tm
is the trend measure output is the function mapping the relation between delay change samples and the trend measure is the network delay experienced by the th received packet
is the size of the input sliding window, i.e., the number of past samples of delay changes on which the trend measure is to be based. Consequently, a novel fuzzy inference architecture called the FTAS has been developed, tailored particularly for the network
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1185
Fig. 13. Architecture of the novel fuzzy network delay trend analyzer system (FTAS).
Fig. 15. Example of the sequence of 20 exponential factors used for aggregation by Layer 3 nodes (as described by (12)).
Any appropriate fuzzy membership function can be used for the input partitioning. However the widely used generalized bell function (11) is employed in the network delay trend analyzer (11)
Fig. 14. (a) Example fuzzy partitioning of the input range. (b) Nonlinear , describing the desired output trend measure function. The function, shape of the function is controlled by the sensitivity parameter .
7(1)
delay trend analysis application (i.e., to implement the operation specified by (10)). The development and architectural details of the FTAS component are described in the following subsection. The proposed decision mechanism for intratalkspurt playout delay adaptation using the FTAS component is presented in Section V-C. B. FTAS The architecture of the FTAS developed is shown in Fig. 13. A salient feature of FTAS is the internal representation of the effect of past inputs on the trend measure output. The system requires only one external input, i.e., the current network delay change sample ( where is the network delay experienced by the th received voice packet). Layer 1 of the analyzer consists of a series of fuzzy blocks in Fig. 13, membership functions, shown as which divide the entire possible input range into a specified of fuzzy partitions. An example partitioning, number composed of nine bell shaped fuzzy membership functions, is shown in Fig. 14(a). The output of Layer 1 is a vector which represents the grade or degrees to which the current delay change input belongs to each of the fuzzy partition.
controls the position (i.e., the center) of the The parameter membership function, the parameter defines the width and the parameter controls the slope of the rising and falling sections of the membership function. Layer 2: The membership function output values of Layer 1 are accumulated over time and stored in a bank of vectors, which form the Layer 2 of the analyzer. The Layer 2 can be viewed as a shift register to hold the membership grades for a window of past delay change inputs. As a result of the delayed store of membership values, the final trend measure output of the system delay change inputs. will be based on the past Layer 3: The accumulated membership function output grades in Layer 2 are aggregated over time by Layer 3 nodes to . An aggregation to obtain a set of weights, by exponential weighting, as given by (12), is suggested to progressively diminish the effect of older inputs as they are shifted through the delayed store of vectors (12) are The absolute values of the factors derived from an exponentially decaying function, such that (i) the sum of all factors is close to 1 and (ii) the factors decay exto a minimum ponentially starting from a maximum value of . The plot of an example sequence of twenty such value of exponentially varying factors is given in Fig. 15. to , represents The output of Layer 3, i.e., vector a measure of the frequency, and also the recency, of each fuzzy network delay partition being represented during the past changes. Layer 4: The output weights of Layer 3 are normalized by , Layer 4 in order to produce the vector delay change which represents the distribution of the past
1186
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
inputs over the possible input range. The normalization procedure is given by (13) for
to
(13)
Layer 5: The prominent feature of the trend analysis is the mapping between the input network delay change and the corresponding trend measure output. The behavior of the desired output trend measure is implemented by Layer 5 nodes. Layer 5 functionality of the trend analyzer is synonymous to the layer of the Takagi-Sugeno fuzzy inference model [26] which implements the crisp functions corresponding to the fuzzy rule consequents. The crisp functions of the network delay trend analyzer are set as simple constants, (as in the case of a zeroth-order Takagi-Sugeno fuzzy inference system). The desired output trend measure is defined as a nonlinear , over the delay change input range. The paramfunction, sameters, , of Layer 5 are the values of the function pled at the center of every fuzzy partition. Each node in Layer , 5 outputs a value which is the node’s function value, scaled by the corresponding normalized weight from Layer 4, . i.e., node output, A nonlinear function as shown in Fig. 14(b) is suggested , as the system for the desired output trend measure, is required to selectively ignore small variations in network delay, while at the same time highlighting larger network delay changes. The function in analytical form is given by (14) (14) where is the delay change input, is the maximum input, and is a parameter which controls the shape of the function. In essence, emerges as a sensitivity parameter to control the responsiveness of the trend analyzer system to low variations in network delay. The nonlinear structure of the desired trend function as described by (14) is inspired by the -law companding equations used in PCM (pulse code modulation) quantization [27] of analog signals. The trend output function is based on similar principles as nonuniform signal expansion in PCM, i.e., enhancing larger values (changes) while diminishing the smaller values (changes). Layer 6: The last layer of the network delay trend analyzer system consists of a single node, which computes the trend measure, tm, as a summation of all incoming values as shown in (15)
(15) and , where a value The value of tm is limited between indicates an extreme decreasing trend in the observed of represents the extreme innetwork delays and the value of creasing trend. C. Decision Mechanism for Intratalkspurt Playout Delay Adaptation 1) Trend Measure to Playout Delay Change Decision: The trend measure from the fuzzy network delay trend analysis
Fig. 16. “Negative trend” to playout delay adaptation decision procedure. Even though the nonlinear trend function is used from trend computation, a linear trend function is assumed for the (negative) trend to delay change decision would be real equivalent of the aggregate decrease in network inversion. is the diminished delay represented by the measured trend, while equivalent of .
1
1
1
system (FTAS), computed on every packet arrival, forms the basis for the decision on the amount by which the playout delay . In the case of the computed needs to be adapted trend measure being negative, i.e., in the case of network delays showing a decreasing tendency, the value for is obtained by inverting the computed trend measure output. However, the inversion is performed as though a linear trend function was employed in FTAS during trend computation and the inversion procedure is illustrated in Fig. 16. The playout delay change decision computation, in analytical form, for the negative trend case is given by (16) when
(16)
In the case of the computed trend being positive, the value for is again obtained by inverting the trend measure output , function is employed function, but the original nonlinear, for the inversion process. Equation (17) gives the analytical representation of the playout delay change decision computation for the positive trend case
when (17) Such a skewed procedure for computation for the two (negative and positive trend) cases results in a playout delay adaptation system that responds slowly to decreasing network delays while adapting quickly to increasing network delays. The responsiveness of the system for decreasing network delays can be controlled by appropriately selecting the value for the parameter (of (17)). Equations (16) and (17) are reliant on the maximum value . It is infeasible to set for the network delay change input, the value for the maximum possible change in network delay value, is allowed in advance. Instead of a fixed to grow as required over the duration of stream reception. The Layer 1 fuzzy partitions are recalculated and the stored values in Layer 2 of FTAS are correspondingly adjusted whenever value is increased.
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1187
2) The Feedback Mechanism for Continuous Playout Delay Adaptation: The playout delay adaptation is performed using a continuous update procedure as given by (18) (18) In (18), is the playout delay computed for the next (upis the playout delay set for the current coming) packet, is the suggested change in playout delay packet and as computed by (16) and (17). The initial value for the playout , is not critical as the system is able to recover from delay, an inferior choice and adapt itself to the current network delays. It is to be noted that the input to the trend analyzer system is the change in network delay observed between successive packets, and not the actual network delay itself. As a result, and due to the inherent operation of the trend analyzer, the parameter in (18) would be a reflection of the aggregate change observed in the network delays. However, it is to be noted that is the true reflection of aggregate change only for increasing network delays while it represents a diminished equivalent of the aggregate change observed for the case of decreasing computation is based only on network delay. Also, the changes observed in network delay and the current level of is not considered while determining the value playout delay . As a result, the playout delay as computed by (18) for has a tendency to display some instability, i.e., undergo uncontrolled increases. A mechanism to stabilise the playout delay computed by (18) is required. For example, though the trend analyzer might correctly detect an increasing trend in the network delay and suggest an equivalent increase in playout delay (through a positive ), in response, the stabilising condition should value for negate the suggestion to increase the playout delay if the curis already greater than the recent network rent playout delay delay levels. In the solution employed for stabilising the playout delay, the current playout delay level is also taken into account . This is achieved through a feedback while computing mechanism and the trend analyzer system is accordingly modified in order to complete the decision process. s As detailed in Section V-B, the second layer of the FTAS stores the membership function output values for the past delay change inputs, i.e., for the inputs , . To accommodate the current playout where delay level into the decision process, an additional column is introduced into the second layer, as shown in Fig. 17, which for an holds the membership function output values . The input is computed as given by (19) input bias
Fig. 17.
Modified Layer 2 of FTAS for playout delay adaptation.
playout delay is greater than the most recent network delay, is a negative value which will induce the trend then measure tm to be more negative. A negative trend measure will and, hence, the playout delay for result in a negative the next packet will be decreased toward the current network is much greater than , then is delay. Similarly, if positive which will induce an increase for . behaves as a noncausal input. It should be noted that are not shifted through The membership output grades for the delayed store and its contents are replaced at every iteration. Only the membership grades at time indexes are shifted through the delayed store of Layer 2 at each iteration with the contents at column 0 being replaced with the membership grades for the causal input . The described feedback mechanism serves another purpose, wherein the FTAS mechanism is enable to consider the limitations of the employed time-scale modification technique. In case the time-scale modification technique is unable to achieve the desired amount of compression or expansion for a particular voice packet, the feedback mechanism would enable this ) to be considered by FTAS while deinformation (through for the subsequent packet. termining the value of 3) Final Intratalkspurt Playout Delay Adaptation Algorithm: Algorithm 1 describes the final intratalkspurt playout delay adaptation mechanism employing the modified fuzzy network delay trend analyzer. Algorithm 1 is essentially similar to the procedure described in [6], with extensions for incorporating the use of the fuzzy delay trend analyzer. The notations used in algorithm 1 are as described in Section IV. The time scale modification of individual packets (in steps 9 and 11 of algorithm 1) is performed using the waveform similarity overlap add technique [24]. As detailed in [6], to avoid extreme scaling of packets, limits on allowable maximum and minimum and lengths for time-scaled packets are set as . Due to the inherent nature of time-scaling procedure, the resulting length of the packet might not exactly be the same as the desired length, hence, the need for steps 14 and 15 in algorithm 1.
(19) Algorithm 1 The intratalkspurt playout delay adaptation process
In (19), bias is a small positive value, typically set equal to a fraction of the packetization interval. The FTAS processing within the remaining layers remains the same, except for Layer 3, where a new set of exponential weighting factors are calculated to take into account the additional column of Layer 2. If the current level of
1: Receive packet ; 2: Compute
;
3: Compute
bias
;
4: Compute the trend measure, tm, using FTAS based on 5: Invert computed tm to obtain 6: Set playout delay for next packet as
and
;
using (16) and (17); . Consequently the
1188
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
TABLE V DISTRIBUTION OF SAMPLES IN THE CONSTRUCTED SPEECH DATABASE
Fig. 18.
Block diagram showing the performance analysis methodology.
playout time for next packet is
; ;
7: Calculate desired length of packet 8: if 9:
then Time-scale modify packet with target length
10: else 11:
Time-scale modify packet with target length
12: end if 13: Output packet with actual length
;
14: Update the playout time for next packet, 15: Update playout delay of next packet,
; ;
VI. PERCEPTUAL BASED ANALYSIS OF FTAS FOR INTRATALKSPURT PLAYOUT DELAY ADAPTATION An offline simulation scheme based on recorded network delay traces is used for analysing the performance of intratalkspurt playout adaptation using FTAS and Concord algorithms. A block diagram of the analysis methodology is shown in Fig. 18. A. Core Components of Analysis 1) Recorded Speech Database Block: A database of speech samples to act as voice source in the analysis was constructed by selecting 100 samples from the ITU-T coded speech database [11]. Each speech sample in the constructed database is 16 s1 in duration. The database created was such that samples were included from three different languages, with samples pertaining to two different speakers (i.e., a female and a male speaker) for each language. The final distribution of the speech samples in the database is shown in Table V. The talkspurt-silence determination for each speech sample was obtained by using the voice activity detection mechanism described in [29]. 2) Network Delay Information Block: The four different Internet delay traces listed in Table II were used in the analysis of FTAS and Concord mechanisms. The procedure employed to obtain these measured network delay traces was described in Section III-B.1. 3) Concord/FTAS Estimation and Playout Buffering: Implementations of the Concord and the FTAS mechanisms are used to compute the packet playout 1The duration of 16 s was used so as to enable the use of perceptual evaluation of speech quality (PESQ) for measuring voice quality. The PESQ algorithm provides reliable results for speech signal durations up to a maximum of 20 s only [28].
schedules for each input network delay trace. Each adaptation mechanism produces a set of packet playout times and any late packets are marked for deletion. An offline simulation of the playout of packets at the receiver is performed in order to obtain reconstructed voice stream at the simulated receiver. The receiver buffering and playout is simulated by recording the speech packets according to the playout schedule presented by Concord or FTAS processing. Any required time-scaling of voice packets is performed prior to this recording with packets determined as lost being replaced with Gaussian noise (with the mean signal power being maintained). 4) Performance Analysis Metrics: The efficiency of intratalkspurt playout delay adaptation is measured in terms of the following performance metrics. 1) Additional Buffering Delay: The additional delay introduced due to buffering is computed for each successfully played packet as . The additional buffering delay computed for all the packets in the trace are averaged to obtain the value for the additional buffering delay metric. 2) Late Packet Loss Rate: The late packet loss metric for the trace is computed as the ratio of number of late packets ) to the number of packets (late for playout, i.e., received. 3) PESQ MOS: PESQ is an objective technique from the ITU-T for measuring voice quality. PESQ compares the degraded speech signal, in this case the receiver reconstructed speech signal, to the original speech signal and gives a MOS (mean opinion score) value representing the quality of the degraded signal. The reference implementation of PESQ, which is included as part of the standard [28], is used in this analysis. The PESQ technique is inherently capable of handling speech artefacts such as distortions due to coding, packet loss, delay, and variable delay that would occur in typical VoIP communication scenarios. PESQ is known to perform time alignment of speech segments from the degraded speech with similar segments from the original speech prior to computing the MOS score. As a result, the MOS score reported by PESQ predominantly reflects any distortion due to packet loss and distortions due to variation in delay (e.g., distortion due to compression or expansion of silent periods in speech) are ignored. For the case of time-scale modified speech, contiguous speech segments are either compressed or expanded. Such distortions would directly impact the time-alignment procedure of PESQ, if PESQ were employed to evaluate the perceptual quality of time-scale modified speech. Liu, Kim, and Kuo demonstrated in [30] that PESQ is sensitive to distortions due
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
Fig. 19. Illustration of the extent of time-scaling performed on speech packets in an example case of intratalkspurt playout delay adaptation using FTAS and Concord mechanisms. The time-scaling ratios depicted are the average of the packet time-scaling ratios within individual talkspurts.
to time-scale modification of speech. PESQ was shown (in [30]) to report a lower MOS score for the case where the only distortion present in the speech sample being evaluated was due to time-scale modification of the speech (either expansion or compression). The intratalkspurt playout delay adaptation process based on the FTAS and Concord mechanisms would result in time-scale modification of every individual speech frame/packet. Fig. 19 illustrates the extent of time-scaling ratios that are applied on speech frames for an example case of intratalkspurt playout delay adaptation using the Concord and FTAS mechanisms. Though the average time-scaling ratios appear to be close to 1.0, the variation in the extent of time-scaling is significantly large and would have an impact on the MOS score reported by PESQ. An increasing extent of time-scale modification of speech, either expansion or compression, was shown in [30] to result in a monotonically decreasing MOS score being reported by PESQ. It is, thus, justified to use PESQ-MOS as an additional metric for evaluating the performance of the FTAS and Concord intratalkspurt adaptation mechanisms, as the PESQ-MOS not only reflects the distortion due to packet loss, but also the distortion due to time-scale modification. B. End-to-End Analysis Process The analysis of intratalkspurt playout delay adaptation is performed based on different network delay traces. For a given trace, a single iteration of end to end processing consists of simulating the receiver playout buffering for one individual source voice sample selected from the database. The iteration results in a single set of values for the performance metrics. The end to end analysis is repeated for all the voice samples in the database, and the resulting performance measures are averaged and presented as a single set of performance measures for the network delay trace. The analysis of the trace is repeated for different values of the configurable parameters of Concord (i.e., the aging coefficient and the aging frequency ) and FTAS (i.e., the history size and the sensitivity parameter ). The analysis of Concord presented here is limited to the aging algorithm 2 [10], and for
1189
the analysis of FTAS, the bias parameter of (19) is fixed as 0.001 s for all network delay traces. Apart from the parameters aging coefficient and aging frequency , the mlp value is another notable parameter which greatly influences the performance of the Concord mechanism. For example, the Concord mechanism configured with a very low value for mlp is expected to achieve lower late packet loss rates at the expense of introducing higher additional buffering delay. However, the late packet loss rates achieved by Concord does not necessarily vary linearly with the configured value for mlp. As seen from the experimental results illustrated in Fig. 20, the late packet loss rates achieved by Concord show very little change for mlp values less than 0.001 for all the four network delay traces. A more prominent variation in late packet loss rate is observed only in the scenario where mlp is in the range 0.01 to 0.05. Additionally, the results illustrated in Fig. 20 can also be employed to determine the least late packet loss rate achievable by the Concord mechanism for each of the network delay traces considered. A value of 0.001 was chosen as the baseline value of mlp for the further analysis of the Concord mechanism which is presented in Section VI-D. As seen from the graphs in Fig. 20, selecting mlp values that are less than 0.001 does not result in any further notable reduction in the achievable late packet loss rate. C. Concord/FTAS Playout Delay Adaptation Example The playout delay adaptation operation achieved by Concord and FTAS mechanisms for an example network delay trace is illustrated in Fig. 21. The network delay trace employed in this example is based on a series of simulated network delays. The employed procedure for simulating packet network delays is similar to the procedure described in Section III-B.1, i.e., individual packet network delays are simulated using the following equation: trend
(20)
In (20), the gamma random values represent the noisy variations in network delay and the trend parameter represents the underlying trend (slowly varying component) of the network delay. The network delay trace used in this example is constructed such that the trend parameter is induced to undergo spiked (sudden) increases and decreases with intermittent periods where the trend is set to a constant level. The shape paramand of the gamma distribution function are varied eters, for the different groups of network delays where the trend parameter is constant. The playout delay adaptation procedure is performed on such a constructed network delay trace using the Concord/FTAS mechanisms to obtain the results illustrated in Fig. 21. The salient, and advantageous, behavior of the FTAS mechanism can be observed from the results shown in Fig. 21. The FTAS mechanism is far more responsive to the increases in network delay than the Concord mechanism (as evident for the voice packets with sequence numbers around 200). The voice packet with sequence number 241 is shown to experience a sudden decrease in the network delay. The FTAS mechanism starts responding immediately to this change and, importantly,
1190
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Fig. 20. Analysis of the variation in late packet loss rate metric achieved by Concord mechanism (aging algorithm 2) for varying values of mlp. The value of aging coefficient c employed to obtain these curves is 0.5 and the value of aging frequency f used is 100 packets. (a) For the NUIG-UNSW (20 ms) trace. (b) For the NUIG-UNSW (40 ms) trace. (c) For the NUIG-DCU (20 ms) trace. (d) For the NUIG-DCU (40 ms) trace.
Fig. 21. Comparison of the playout delays as achieved by FTAS and Concord mechanism for a simulated delay trace with induced discrete jumps, based on a single source speech signal. The values for the configurable parameters of the FTAS mechanism used were: fHistory size = 20 packets, = 10.0, and bias = 0:0sg. The values used for the Concord parameters were: fAging algorithm 2, Aging frequency = 10; Aging coefficient = 0.5, and MLP = 0.001g.
the playout delay is gradually decreased over a multiple number of subsequent packets. In contrast, the playout delay adaptations by the Concord mechanism occur mainly at the instances of histogram aging operation. The Concord mechanism responds to the sudden decrease in network delay (occurring at sequence number 241) by decreasing the playout delay for voice packets that are received much later (i.e., for packets with sequence number around 275 to 280). Considering that a packetization interval of 0.04 s is employed in the simulation, the Concord
mechanism requires a time duration of 1.4 to 1.6 s (in this example) to start responding to the sudden decrease in network delay (compared with the duration of 560 ms required by FTAS to completely respond to the decrease in network delay). An interesting comparison in the adaptation behavior of FTAS and Concord mechanisms can be observed for the packets with sequence numbers in the range 260 to 320. The adaptation process using FTAS mechanism results in a steady-state value for the playout delay which is slightly higher than the underlying network delays. Such a behavior of FTAS is due to the fact that the network delays experienced by these packets are fairly constant and do not exhibit any notable differences in the delay between subsequent packets. The FTAS mechanism attempts to follow the constant trend existing in the network delay and, hence, does not result in frequent adaptations of the playout delay. In contrast, the adaptation process using the Concord mechanism results in a playout delay which follows the absolute network delays more closely (for the packets with sequence numbers in the range 290 to 320). Though FTAS mechanism introduces a higher additional buffering delay compared to Concord for the packets with sequence numbers in the range 290 to 320, the reduction in buffering delay achieved by FTAS due to a quicker response time for the previous spiked decrease in network delay (for the packet with sequence number around 280) more than compensates for the extra buffering delay introduced by FTAS. D. Results From End to End Analysis In the following graphical illustration of the results from the analysis of FTAS, the values obtained for the different performance metrics are plotted against varying values for the configurable parameters, i.e., the size of history (in terms of number of packets) and the sensitivity parameter . Similarly, the analysis
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
1191
Fig. 22. Analysis results corresponding to the NUIG-UNSW trace with 20-ms packetization interval. (a) FTAS, PESQ MOS. (b) FTAS, Addn. Buff. Del. (c) FTAS, Late Pkt. Loss. (d) Concord, PESQ MOS. (e) Concord, Addn. Buff. Del. (f) Concord, Late Pkt. Loss.
Fig. 23. Analysis results corresponding to the NUIG-UNSW trace with 40-ms packetization interval. (a) FTAS, PESQ MOS. (b) FTAS, Addn. Buff. Del. (c) FTAS, Late Pkt. Loss. (d) Concord, PESQ MOS. (e) Concord, Addn. Buff. Del. (f) Concord, Late Pkt. Loss.
results for the Concord mechanism are plotted against different values for the aging frequency (in terms of number of packets) and aging coefficient . Fig. 22 compares the performance of FTAS and Concord mechanisms for the Internet delay trace measured on the NUIGUNSW path with a packetization interval of 20 ms. The FTAS mechanism is seen to achieve much lower late packet loss rates than the Concord mechanism at the expense of introducing additional buffering delays of the order of 10 ms. The performance of FTAS, in terms of the late packet loss rate and the
additional buffering delay metrics, is seen to converge with that of the Concord mechanism, only in the region where FTAS is configured with higher values for history size and . A similar behavior is observed in the performance in terms of the PESQ MOS metric, wherein the FTAS mechanism is able to achieve much higher MOS values when compared to Concord for most of the values of the configurable parameters. In the cases where FTAS is configured with higher values of history size and , the performances of FTAS and Concord in terms of the PESQ MOS metric tend to converge. It is to be noted that the reported late
1192
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
Fig. 24. Analysis results corresponding to the NUIG-DCU trace with 20-ms packetization interval. (a) FTAS, PESQ MOS. (b) FTAS, Addn. Buff. Del. (c) FTAS, Late Pkt. Loss. (d) Concord, PESQ MOS. (e) Concord, Addn. Buff. Del. (f) Concord, Late Pkt. Loss.
Fig. 25. Analysis results corresponding to the NUIG-DCU trace with 40-ms packetization interval. (a) FTAS, PESQ MOS. (b) FTAS, Addn. Buff. Del. (c) FTAS, Late Pkt. Loss. (d) Concord, PESQ MOS. (e) Concord, Addn. Buff. Del. (f) Concord, Late Pkt. Loss.
packet loss rates (around 0.075) achieved by Concord are the lowest achievable by Concord for this particular trace. As was noted in Fig. 20(a), further tuning of the mlp parameter does not result in any further reduction of the late packet loss rate. Fig. 23 compares the performance of FTAS and Concord mechanisms for the Internet delay trace measured on the NUIGUNSW path with a packetization interval of 40 ms. The FTAS mechanism performs best for this particular trace as the FTAS mechanism shows almost a 1.0 point improvement in the PESQ MOS rating compared to the Concord method for all values
of simulation parameters. As in the case of the NUIG-UNSW (20 ms) trace, the FTAS mechanism introduces slightly higher buffering delay (around 5 ms to 10 ms), but demonstrates much lower packet loss rates than the Concord mechanism. As established in Fig. 20(b), the lowest late packet loss rate achievable by Concord is around 0.06 for the entire region of the simulation parameters considered. Interestingly, both FTAS and Concord mechanisms show an improvement in their own performance (in terms of PESQ MOS) for the 40-ms trace when compared to the 20-ms NUIG-UNSW trace. A justifiable reason for such a be-
RANGANATHAN AND KILMARTIN: NEURAL AND FUZZY COMPUTATION TECHNIQUES
havior is that the time-scale modification algorithms used to expand or compress the voice packets are known to perform better for longer packet lengths. For good quality time-scale modification of speech, the packet length needs to be a multiple of the speech pitch period. The performance of the FTAS mechanism for the NUIG-DCU trace with 20-ms packetization interval, as shown in Fig. 24, though comparable, is slightly inferior to the performance of Concord in terms of the voice quality rating (PESQ MOS). The already observed behavior of FTAS providing lower packet rates but introducing higher buffering delays is repeated again for this trace. Fig. 25 compares the performance of the FTAS and Concord mechanisms for the final Internet delay trace measured between the paths NUIG-DCU with a packetization interval of 40 ms. As observed from the illustrations of the performance metrics, FTAS, and Concord mechanisms tend to exhibit similar performances for this particular network delay trace. As evident from Figs. 20(d) and 25(f), the lowest achievable late packet loss rate by Concord mechanism is around 0.03, whereas the FTAS mechanism is seen to achieve lower late packet loss rates [as illustrated in Fig. 25(c)]. VII. DISCUSSION AND CONCLUSION 1) Intertalkspurt Playout Delay Adaptation: In Section III the utility of neural network-based estimation techniques for intertalkspurt playout delay adaptation was reviewed. The demonstrated improvement in performance by the use of neural network-based techniques comes at a higher cost in terms of the computational requirements for training the network and the overall system complexity. The complexity of neural network architectures and their memory requirements increase with the size of the network, i.e., the number of layers in the network and the number of neurons in each layer. When compared to MLP network, the TDNN is much more complex, due to the presence of internal input tapped delay lines in every layer. Also, the training process of TDNN is much slower than that used in an MLP due to the huge increase in number of synaptic weights that need to be updated during training. For example, the number of synaptic weights in the MLP structure described in Section III-A.1 is 264, whereas the TDNN structure of Section III-A.2 incorporated 1080 adaptable synaptic weights. However, when analyzed for dynamic response [16], the MLP and TDNN architectures showed better performance. The TDNN estimator, especially, is inherently capable of adapting to any large spikes in the network delay. The traditional approaches, as described in [5] and [31], attempt to counter any network delay spikes through the use of an additional spike detector component. On the detection of a spike in the network delay, the playout delay adaptation control is transferred to the spike detector algorithm and the normal recursive filter-based adaptation is resumed on the detection of the end of network delay spike. The TDNN estimator is capable of inherently performing the network delay spike detection. 2) Intratalkspurt Playout Delay Adaptation: The process of intratalkspurt playout delay adaptation, first introduced in [6] and [7], is more complex than the intertalkspurt paradigm, but
1193
is an efficient way of minimizing late packet losses while introducing lesser additional buffering delays. The efficiency of adaptation very much relies on the performance of the timescale modification technique. The waveform similarity overlap add algorithm (WSOLA) [24] is well suited for this task in the playout adaptation application. However, for efficient perpacket time scale modification, WSOLA requires the packet lengths to be of reasonably longer lengths. As seen in Section VI-D, the resulting voice quality is better for the network delay traces measured with 40-ms packetization intervals when compared with the corresponding delay traces for a 20-ms packetization interval. The major contributor to the efficiency of intratalkspurt playout delay adaptation is the network delay estimator. Though the Concord approach is intuitive and well suited for this application, a better performance is provided by the novel fuzzy delay trend analyzer-based technique proposed in this paper. The better performance of FTAS compared to Concord was also confirmed based on extensive analysis using simulated network delay traces with various forms of induced delay trends. Another important advantage of the FTAS mechanism is that the improvement in performance comes at no extra costs such as training requirements as in the case of neural networks. The proposed fuzzy trend analyzer architecture can also be employed in various other time series engineering applications. REFERENCES [1] T. J. Kostas, M. S. Borella, I. Sidhu, G. M. Schuster, J. Grabiec, and J. Mahler, “Real-time voice over packet-switched networks,” IEEE Network, vol. 12, no. 1, pp. 18–27, Jan./Feb. 1998. [2] M. Hassan, A. Nayandoro, and M. Atiquzzaman, “Internet telephony: Services, technical challenges, and products,” IEEE Commun. Mag., vol. 38, no. 4, pp. 96–103, Apr. 2000. [3] W. A. Montgomery, “Techniques for packet voice synchronization,” IEEE J. Sel. Areas Commun., vol. JSAC–1, no. 6, pp. 1022–1028, Dec. 1983. [4] ITU-T Recommendation G.114: One-Way Transmission Time, 2000. International Telecommunication Union. [5] R. Ramjee, I. Kurose, D. Towsley, and H. Schulzrinne, “Adaptive playout mechanisms for packetized audio applications in wide-area networks,” in Proc. IEEE INFOCOM, Toronto, ON, Canada, Jun. 1994, pp. 680–688. [6] Y. J. Liang, N. Farber, and B. Girod, “Adaptive playout scheduling using time-scale modification in packet voice communications,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Salt Lake City, UT, May 2001, pp. 1445–1448. [7] F. Liu, I. Kim, and C. I. Kuo, “Adaptive delay concealment for internet voice applications with packet based time-scale modification,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing, Salt Lake City, UT, May 2001, pp. 1461–1464. [8] N. Laoutaris and I. Stavrakakis, “Intrastream synchronization for continuous media streams: A survey of playout schedulers,” IEEE Network, vol. 16, no. 3, pp. 30–40, May/Jun. 2002. [9] E. Wan, Time Series Prediction Using a Neural Network With Embedded Tapped Delay-Lines. ser. SFI Studies in the science of complexity, A. Weigend and N. Gershenfeld, Eds. Reading, MA: Addison Wesley, 1993. [10] C. J. Sreenan, J.-C. Chen, P. Agrawal, and B. Narendran, “Delay reduction techniques for playout buffering,” IEEE Trans. Multimedia, vol. 2, no. 2, pp. 88–100, Jun. 2000. [11] ITU-T Coded-Speech Database, 1998. Supplement 23 to ITU-T P-Series Recommendations, International Telecommunication Union. [12] S. Casner, R. Frederick, V. Jacobson, and H. Schulzrinne, RTF: A transport protocol for real-time applications, , IETF RFC 1889, Jan. 1996. Available:. [13] C. Perkins, O. Hodson, and V. Hardman, “A survey of packet loss recovery techniques for streaming audio,” IEEE Network, vol. 12, no. 5, pp. 40–48, Sept./Oct. 1998.
1194
[14] G. Box and G. Jenkins, Time Series Analysis: Forecasting and Control. New York: Holden-Day, 1976. [15] P. L. Tien and M. C. Yuang, “Intelligent voice smoother for silencesuppressed voice over internet,” IEEE J. Sel. Areas Commun., vol. 17, no. 1, pp. 29–41, Jan. 1999. [16] M. K. Ranganathan and L. Kilmartin, “On the use of neural network based estimation techniques for playout delay adaptation in voip networks,” in Proc. Int. Conf. Computational Intelligence for Modeling, Control, and Automation (CIMCA’03), M. Mohammadian, Ed., Vienna, Austria, Feb. 2003, pp. 497–508. [17] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Englewood Cliffs, NJ: Prentice Hall, 1999, ch. 4, pp. 161–168. [18] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 3, pp. 328–339, Mar. 1989. [19] K. J. Lang, A. Waibel, and G. Hinton, “A time-delay neural network architecture for isolated word recognition,” Neural Netw., vol. 3, no. 1, pp. 23–43, 1990. [20] A. Back and A. Tsoi, “FIR and IIR synapses, a new neural network architecture for time series modeling,” Neural Computat., vol. 3, no. 3, pp. 375–385, 1991. [21] A. Back, E. Wan, S. Lawrence, and A. Tsoi, “A unifying view of some training algorithms for multilayer perceptions with FIR filter synapses,” in Proc. IEEE Workshop Neural Networks for Signal Processing 4 (NNSP’94), J. Vlontzos, J. Hwang, and E. Wilson, Eds. New York, NY, 1994, pp. 146–154. [22] A. Corlett, D. Pullin, and S. Sargood, Statistics of one-way internet packet delays, in Internet Draft (Work in Progress), 2002. Available:. [23] K. Sriram and W. Whitt, “Characterizing superposition arrival processes in packet multiplexers for voice and data,” IEEE J. Sel. Areas Commun., vol. 4, no. 6, pp. 833–846, Sep. 1986. [24] W. Verhelst and M. Roelands, “An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, Minneapolis, MN, Apr. 1993, pp. 554–557. [25] M. K. Ranganathan and L. Kilmartin, “Perceptual based analysis of the concord algorithms for intratalkspurt playout delay adaptation,” in Proc. Irish Signals and Systems Conf. (ISSC’04), Belfast, Northern Ireland, Jun. 2004, pp. 195–200. [26] J.-S. R. Jang, C.-T. Sun, and E. Mizutani, Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence, 1st ed. Englewood Cliffs, NJ: Prentice Hall, 1997, pp. 81–84. [27] S. Haykin, Communication Systems, 3rd ed. New York: Wiley, 1995, ch. 6, pp. 378–381.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 16, NO. 5, SEPTEMBER 2005
[28] Perceptual Evaluation of Speech Quality (PESQ), an Objective Method for End-to-End Speech Quality Assessment of Narrow-band Telephone Networks and Speech Codecs, 2001. P. 862, International Telecommunication Union. [29] A Silence Compression Scheme for G.729 Optimized for Terminals Conforming to Recommendation V.70, 1996. ITU-T Recommendation G.729—Annex B, International Telecommunication Union. [30] F. Liu, J. Kim, and C.-C. J. Kuo, “Quality enhancement of packet audio with time-scale modification,” in Proc. SPIE ITCOM: Multimedia Systems and Applications V, Boston, MA, Jul. 2002, pp. 163–173. [31] S. B. Moon, J. Kurose, and D. Towsley, “Packet audio playout delay adjustment: Performance bounds and algorithms,” Multimedia Syst., vol. 6, no. 1, pp. 17–28, Feb. 1998.
Mohan Krishna Ranganathan received the B.E. degree in electronics and communication engineering from Bangalore University, Bangalore, India, in 1998 and the Ph.D. degree in electronic engineering from National University of Ireland, Galway, in 2005. He is currently working as a Senior Software Engineer at Sasken Communication Technologies Limited, Bangalore, India. His research interests include VoIP, network security, communication network modeling, mobile communication systems and applications of soft computing techniques in prediction and control.
Liam Kilmartin (M’97) received the B.E. and M.Eng.Sc. degrees in electronic engineering from University College Galway, Galway, Ireland, in 1990 and 1994, respectively. He has been a Lecturer in the Department of Electronic Engineering, National University of Ireland, Galway, since 1994. His current research interests include advanced communication networks, mobile networking technologies and the application of speech processing and neural network techniques in communication networks.