a new hybrid long-term and short-term prediction algorithm for packet ...

0 downloads 0 Views 64KB Size Report
Packet loss is a common problem in Internet Protocol (IP) networks. Delayed, misrouted or corrupted packets all introduce a gap in the information stream being.
A NEW HYBRID LONG-TERM AND SHORT-TERM PREDICTION ALGORITHM FOR PACKET LOSS ERASURE OVER IP-NETWORKS Maha Elsabrouty, Martin Bouchard, Tyseer Aboulnasr School of Information Technology and Engineering, University of Ottawa, 161 Louis-Pasteur, Ottawa (Ontario), K1N 6N5, email: [email protected] ABSTRACT Packet loss is a common problem in Internet Protocol (IP) networks. Delayed, misrouted or corrupted packets all introduce a gap in the information stream being transmitted. This gap is even more critical in the case of real time voice transmission that does not tolerate delay. The receiver in this case is obliged to generate a signal to play instead of the missing speech segment. This paper introduces a high performance speech concealment algorithm for PCM coded speech. The proposed algorithm implements a combination of Linear Prediction model and Reverse Order Replicated Pitch Period (RORPP) implemented as in the ITU-T G.711 Appendix A [1]. The new algorithm produced better objective MOS scores when compared to both the commercial tool of packet repetition and to the above mentioned ITU-T long term prediction standard.

1. INTRODUCTION Voice-over-IP (VoIP), the transmission of packetized voice over IP networks, is gaining much attention as a possible alternative to conventional Public Switched Telephone Networks (PSTN). However, impairments present on IP networks, namely jitter, delay and channel errors can lead to the loss of packets at the receiving end. This packet loss degrades the speech quality. Modelbased coders, especially G.729-A [2] and G.723.1 [3] International Telecommunication Union (ITU-T) standards, have been extensively used for speech coding over IP networks because of their inherent ability to recover from erasure and their small bandwidth. Their built-in packet loss concealment makes their quality drop slowly with increasing amount of packet loss. However, their memory requires a few frames for the transition from a concealed state to a correct state. Thus they actually tend to corrupt a few good packets before recovery as a result of a phenomenon known as “State Error" [4]. On the other hand, Pulse Code Modulation (PCM) [5], although having a higher score compared to G.729-A and G.723.1 in the periods of normal operation, does not have the ability to conceal erasure. This results in a dramatic drop in the quality of speech during loss periods. Yet, PCM-based coders can recover from packet This work was supported by the National Science and Engineering Research Council (NSERC), Canada

loss more rapidly than model-based coders, since the first speech sample in the first good packet restores speech to its original quality. The low complexity of PCM and its good performance in tandem coding make it a viable alternative to G.729-A or G.723.1 for VoIP. Several approaches have been implemented to address the frame erasure problem in PCM streams. The simplest approach is to play a mute (silence) packet in the erasure period. This method, however, introduces annoying voice clipping and most subjective tests proved that this method deteriorates the speech quality even at very low packet loss rates [6],[7]. Many other concealment algorithms depend on the quasi-stationary property of speech (i.e. not a lot of new information is delivered in the duration of a 10 ms to 30 ms lost packet). One of the popular commercial concealment algorithms repeats the speech signal received in the last speech packet. This method performs better than silence substitution but its quality is still not satisfactory for high-quality applications. ITU-T has lately standardized (in G.711 Appendix A [1]) a high quality low complexity PCM coded speech concealment method. This method depends on waveform substitution. The Packet Loss Concealment (PLC) algorithm first performs pitch detection on a sufficient length of speech samples kept in the history buffer (390 samples of 8 kHz-sampled speech). The concealment unit then places the pointer one pitch period backward and copies a speech signal of the duration of the lost packet. This pitch predicted replica is played in the gap resulting from the missing speech segment. The algorithm also performs an overlap and add operation at the transition between the last received good samples and the concealed ones. This overlap and add is to ensure a smooth and natural transition and higher quality for the resulting concealment. However, this results in an added algorithmic delay of 3.75 ms [1]. The algorithm introduces a low complexity of 0.5 MIPS. Another standard method is presented in the ANSI standard T1-521-2000 (Appendix B) [8]. This method depends on the well-known linear prediction (LP) model in estimating the missing speech waveform. This standard adopts the model-based codecs approach: it implements a complete analysis/synthesis model to extract the shortterm and long-term excitation from the previous correctly received speech. Then, the synthesis unit uses these

parameters along with the most recently received speech samples (as initial conditions for the inverse LP-filter) to synthesize an approximation of the missing speech segment. This method introduces an algorithmic delay of 5 ms (a half 10 ms correct packet) to perform the smoothing transition between the last good speech segment and the beginning of the concealed one. It also requires a much higher complexity (2.3 MIPS for 10 ms packet) which is around 5 times the complexity of ITU-T G.711 Appendix A [6],[8]. The resulting concealment quality of this method is comparable to the ITU-T G.711 Appendix A [6],[8]. In this paper we present a new receiver-based PLC algorithm for packetized PCM coded speech. It is designed to work with the conventional sampling rate of 8 kHz and frame sizes of 10 ms. The proposed algorithm does not require any delay and has an affordable complexity of 1.85 MIPS. The rest of this paper is organized as follows. In Section 2, the concealment model is described. Section 3 presents the quality assessment test for the new method as well as simulation results confirming the improved performance of the proposed algorithm. We then conclude the paper in Section 4.

2. THE NEW PACKET LOSS CONCEALMENT ALGORITHM 2.1 Prediction Equation The new LP-based concealment technique is based on the prediction with a sufficiently large order filter that is capable of accurately modeling the speech:

P S (n) = ∑ [a(i) × S ( n − i )] + b(n) (1) i =1 where S (n) is the nth speech sample, P is the prediction

order, which was set to 50 as will be explained later, a (i) are the LP coefficients and b(n) is the residual signal. As can be seen from (1) the current speech sample S (n) is composed of two components. The first component is the predictable part carrying the information of the vocal tract along with the correlation between the current sample and the previous ones. The second component is the residual signal b(n) that contains the current unpredictable excitation. The ideal case is when the LPC filter is capable of accounting for the whole correlation between the current sample and the past samples. In this case, the prediction error is a random excitation signal reflecting the unpredictability of b(n) . However, if the LPC fails to extract the complete correlation between the successive samples, the residual signal is coloured (i.e. it has some correlation with the original speech signal).

In the case of lost packets, the previous correct speech samples are available and thus the predictable term in (1) can be extracted by linear prediction analysis. However, the input residual signal is unknown to the receiver side. In this case a good choice can be to use a small percentage of the pitch-predicted signal as the input excitation for the system. Here, the pitch-predicted signal of the lost frame refers to a (RORPP) Reverse Order Pitch Period Replication of the lost frame, estimated in a manner similar to the concealment algorithm implemented in the ITU-T standard G.711-Annex A [1]. The residual signal added will then be coloured and having the same waveform as the pitch predicted signal, and also most probably the missing speech segment. Thus, using a small percentage of the pitch-predicted signal we can rewrite (1) to be: P

S ( n) =

∑ a (i ) × S ( n − i )

i =1

[

+ Sˆ (n) × G

]

(2)

where S (n) denotes the LPC prediction and Sˆ (n) is the pitch-predicted signal obtained from the ITU-T G.711-A (RORPP) concealment standard. G =0.01 was found to give the best results in practice.

Next, we propose to modify the algorithm by using a weighted summation of the short-term prediction  P  )   from (2) and the pitcha ( i ) S ( n i ) S ( n ) G × − + × ∑   i 1 =   based prediction Sˆ (n) to provide a better approximation

(

)

of the original signal. Thus the final form of the prediction algorithm is: P S1 (n) = ∑ a(i) × S1 (n − i ) + Sˆ (n) × G (3) i =1

[

S ( n) = α × S1 ( n) + β × Sˆ ( n)

]

(4)

where S (n) is the final form of the concealed signal to be played instead of the missing speech frame, α and β are summation weights that add up to unity . The best results were obtained with α=0.7 and β=0.3. 2.2 How the algorithm works

During the normal operation of the PCM decoder (period of no loss), the receiver decodes the received packets and sends the output to the audio port. Meanwhile, in order to support the concealment algorithm, a copy of the decoded output is saved in a history buffer that is 390 samples long. The history buffer is used to calculate the autocorrelation function, estimate both the pitch and the LP coefficients, extract the pitch replica and provide the

past samples S ( n − i ) ; 1 < i < P where P is the order of the prediction filter. A lost speech segment contains at least one lost packet but may contain more. The majority of the computational load for the concealment algorithm is in the first 10 ms of erasure (the 1st lost frame). Figure 1 shows a block diagram of the principal blocks of the concealment algorithm. At the start of the erasure period, the pitch detection unit estimates the current value of the pitch by searching among the peaks of the autocorrelation coefficients calculated as in the ITU-T concealment standard G.711A. The samples Sˆ (n) found by this pitch-prediction method will be used twice. They are first multiplied by the gain G, which is equal to 0.01. This re-scaled signal is used as the short-term excitation of the speech production model (3). The same signal is weighted by a factor of 0.3 and then added to the output of the synthesis LP filter S1 (n) , weighted by a factor of 0.7 as in (4). Meanwhile, the first 50 coefficients of the autocorrelation function of the last 20 ms (160 samples) of speech are calculated. The LP coefficients are calculated in the LPanalysis block that implements the Levinson-Durbin algorithm for LP estimation. The LP prediction order was chosen to be 50 to cover at least one pitch period in female speech, which had shown to deteriorate more severely than male speakers quality when both are subject to the same loss rates. These 50 coefficients are used as the poles of the LP-synthesis filter which is the model of the speech production. Typically one frame has 80 samples. However, we have modified that model to produce 90 samples per lost frame instead of 80 samples, to allow for a smooth transition between packets. The last 10 samples are the predicted values of the packet following the lost packet. If the next packet is lost then these values are played as the concealed samples of that lost frame. However, if the next packet is not lost than these samples are multiplied by a decaying ramp and added to the corresponding first 10 samples in the new correct speech sequence that are to be multiplied by an uprising ramp. The output of the addition is played instead of the first 10 good samples after erasure. This cross fading process guarantees a smooth transition from the concealed speech segment to the good speech packets. If the erasure lasts more than 10 ms (one packet period) no new parameters are calculated. We re-use the previously obtained parameters used for the first lost packet concealment with the slight modification of changing the long-term estimated period samples, as in ITU-T G.711-A. In the case of consecutive lost packets, the pitch-predicted replica is multiplied by a decaying ramp starting at the initial value 1 and decaying at a rate

of 0.2 per 10 ms. This ramp multiplication introduces a smooth decay increasing along the loss period. Eventually, at 60 ms of continuous erasure, the pitchreplica and the input residual signal are zeros and (3) turns to a no input LP model than eventually decays due to its stability.

3. PERFORMANCE OF THE PROPOSED ALGORITHM The new algorithm is compared to the ITU-T standard concealment tool G.711-A, and to the packet repetition method. The test was performed on a set of speech files from four speakers; two males and two females. Each of those speakers has 10 speech files to investigate, each containing two sentences in English of duration eight seconds. The format of the files was linear PCM. The assessment tool used to evaluate the results of the concealment techniques is the Perceptual Estimation of Speech Quality (PESQ) standard P.862 developed by the ITU-T [9]. It is the newest and most accurate tool [10] in the perceptual based standards that has shown to give reliable estimation of the subjective quality tests. The score is given in the range [-0.5 4.5], similar to the standard Mean Opinion Score (MOS) scale. A random loss test was performed at loss rates of 5%, 10% and 25%. Figures 2-4 summarize the average results for the three loss rates. We can see from these figures that the performance of the new algorithm is superior to both the existing ITU-T standard and the packet repetition. Actually, the performance of the packet repetition method is much worse than both the new algorithm and the ITU-T concealment standard. A small but significant and almost steady margin appears as a difference between the new algorithm and the ITU-T standard. This margin presents the performance gain of incorporating the LP model with the plain long-term pitch-repetition-based concealment standard. Extensive tests with periodic loss patterns were also performed and produced nearly identical results. When comparing the concealment algorithms, it should also be noted that the proposed new algorithm does not introduce any delay, as opposed to the ITU-T standard (5 ms delay).

4. CONCLUSION In this paper we introduced a new concealment algorithm for PCM packetized speech of 10 ms packet length. The model implemented in (3),(4) provides very encouraging results for the idea of combining the pitch prediction along with the LP-based prediction to produce the concealed speech segments. The PESQ-MOS scores obtained for the random loss tests proved that the algorithm exhibits a superior high quality concealment performance in all the cases when compared to the existing commercial packet repetition technique or the ITU-T standardized concealment technique.

0.3

240 samples from speech buffer (30 ms)

Pitch detector

Overlap buffer

G

Pitch Period Long Prediction

0.7

Autocorrelation unit

Inverse LP filter

Other 80 samples Reconstructed signal

LP analysis

Last 10 samples

Speaker

Initial conditions

Last 50 samples

Fig 1. Block Diagram of the New Algorithm for the First Lost Packet

5. REFERENCES PESQ-MOS

3.4 3.2 3 2.8 Male1

Male2

Female1

Female2

Fig. 2. The Average Results for 5% Random Packet Loss (new

algorithm:♦, G.711-A:■, packet repetition:▲)

PESQ-MOS

3.3 3.1 2.9 2.7 2.5 2.3 Male1

Male2

Female1

Female2

Fig. 3. The Average Results for 10% Random Packet Loss (new algorithm:♦, G.711-A:■, packet repetition:▲) 2.8 PESQ-MOS

[1] ITU-T Recommend. G.711 Appendix A "A High Quality Low-Complexity Algorithm for Packet Loss concealment with G.711", Nov. 2000. [2] ITU-T Recommend. G.729 "Coding of Speech at 8 kb/s Using Conjugate-Structure Algebraic-CodeExcited Linear-Prediction (CS-ACELP)", Mar. 1996. [3] ITU-T Recommend. G.723.1 "Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 and 6.3 kb/s", Mar. 1996. [4] Montminy C. and Aboulnasr T. “Improving the Performance of ITU-T G.729A for VoIP”, International Conference on Multimedia Exposition 2000 (ICME 2000), 30 July - 2 Aug. 2000, New York, NY, USA, vol.1, pp.433- 436. [5] ITU-T Recommend. G.711 "Pulse Code Modulation (PCM) of Voice Frequencies", Nov.1998. [6] Gunduzhan E. and Momtahan K. “A Linear Prediction Based Packet Loss Concealment Algorithm for PCM Coded Speech”, IEEE Transactions on Speech and Audio Processing, Vol. 9, No.8 , Nov. 2001, pp.778- 785. [7] Hassan M. and Nayandoro A. “Internet Telephony: Services, Technical Challenges, and Products”, IEEE Communication Magazine, April 2000, pp. 96-103. [8] ANSI Recommend. T1.521-2000 (Annex B) "Packet Loss Concealment algorithm for use with ITU-T Recommendation G.711", July 2000 [9] ITU-T Recommend. P.862 "Perceptual evaluation of speech quality (PESQ), an objective method for endto end speech quality assessment of narrowband telephone network and speech codecs", May 2000. [10] A.W. Rix et al. “Perceptual Evaluation of Speech Quality (PESQ) - A New Method for Speech Quality Assessment of Telephone Networks and Codecs”, ICASSP 2001, May 7-11 2001, Salt Lake City, UT, USA, vol.2, pp. 749 752.

3.6

2.6 2.4 2.2 2 1.8 Male1

Male2

Female1

Female2

Fig.4. The Average Results for 25% Random Packet Loss (new algorithm:♦, G.711-A:■, packet repetition:▲)

Suggest Documents