An E-Model Implementation for Speech Quality Evaluation ... - CiteSeerX

52 downloads 17492 Views 251KB Size Report
voice call into a single factor, which can be converted into a MOS ... working on a new VoIP voice quality measurement standard .... impairments that occur at the beginning of a call. [1,6]. .... Management Center at Rio de Janeiro, but there is a.
An E-Model Implementation for Speech Quality Evaluation in VoIP Systems Leandro Carvalho, Edjair Mota, Regeane Aguiar Federal University of Amazonas (UFAM) Av. Rodrigo Octavio, 3000 - 69077-000 - Brazil [email protected], {edjair,rba}@dcc.fua.br

Ana F. Lima, José Neuman de Souza Federal University of Ceará (UFC) Av. Universidade, 2853 - 60020-181 - Brazil [email protected], [email protected]

Anderson Barreto Nokia Institute of Technology (INdT) Rod. Torquato Tapajós, 7200 - 69048-660 - Brazil [email protected]

Abstract This article presents a voice quality measurement tool based on the ITU-T E-Model. Firstly, the ITU-T and ETSI specifications of E-Model are briefly reviewed and some errors found in these documents are pointed. After, a measurement tool based on the corrections is described. VoIP calls through the Brazilian National Education and Research Network (RNP) backbone were used to verify the tool operation.

1. Introduction In the last years, voice over Internet Protocol (VoIP) has became an important application running over TCP/IP networks. Real-time voice applications, like VoIP, require low delay and packet loss rates in order to not affect the interaction between talkers and the speech understanding. However, IP networks offer a best-effort service and does not guarantee to accomplish these requirements. Thus, impairments like packet loss, delay and jitter affects the end-to-end speech quality. QoS mechanisms and application level control have been developed to overcome these problems and maximize the call quality. However, considering different and heterogenous network scenarios, how can we measure their effectiveness under certain conditions? The earlier attempts on this issue were to apply subjective tests for evaluating the perceived voice quality. The Mean Opinion Score (MOS) test is a widely accepted standard for speech quality rating. The MOS tests procedures are presented in ITU-T Rec.

P.800 [9], by which users rate the speech quality in a scale from 1 (poor quality) up to 5 (excellent). The number of listeners has to be big enough in order to not drift the mean score. Thus, subjective MOS tests are time-consuming, expensive and do not permit real time measurements [1,3]. In recent years, some methods for measuring MOS objectively were developed. The ITU-T G.107 [8] defines the E-model, a computational model combining all impairment parameters that affect a voice call into a single factor, which can be converted into a MOS scale. The recent Real Time Control Protocol - Extended Report (RTCP-XR), defined in RFC 3611 [7], proposes a scheme to exchange voice quality information given by the E-Model calculation in order to enable feedback responses. Further, ITU-T is working on a new VoIP voice quality measurement standard, the P.VTQ, mainly based on the E-Model, expected to be approved in early 2005. Thus, the objective of this work is to implement a measurement tool based on the E-Model, in order to make possible to carry research projects based in speech quality improvement. We have found some errors in the E-Model specifications and we suggest corrections for them. This paper is organized as follows. Section 2 briefly describes the E-Model. Section 3 lists the errors that were found during the E-Model implementation and suggests corrections for them. Section 4 describes the measurement tool and the scenario that was utilized to observe it working. In Section 5 the results obtained from the measurement tool utilization over the Brazilian National Education and Research Network

(RNP) are presented. Finally, in Section 6 we present our conclusions about this work.

where Id is a function of the absolute one-way delay and Ie is, in short, a function of the used codec type and the packet loss rate.

2. The E-Model 2.1. The ETSI Extended E-Model The E-Model is based on the concept that psychological factors on the psychological scale are additive [8], or else, that each impairment factor which affects a voice call can be computed separately, even so this does not imply that such factors are uncorrelated, but only that their contributions to the estimated impairments are separable [2]. The resulting score is the transmission rating R factor, a scalar measure that ranges from 0 (poor) to 100 (excellent). R factor values below 60 are not recommended [2]. According to [8], the R factor is related to MOS as follows: For R < 0: MOS = 1 For 0 R 100: MOS = 1 + 0.035 R + 7.10-6 R(R-60)(100-R) For R > 100: MOS = 4.5

(1)

The R factor can be obtained through the following expression [8]: R = Ro - Is - Id - Ie + A

(2)

where Ro represents the basic signal-to-noise ratio (SNR); Is represents the combination of all impairments which occur more or less simultaneously with the voice signal; Id represents the impairments caused by delay; Ie represents impairments caused by low bit rate codecs; and A is the advantage factor, that corresponds to the user allowance due to the convenience in using a given technology. Once the used codec is well-known (through the Ie factor), we only need to capture network (delay and loss) and application (dejitter buffer delay, used codec) statistics for estimating the speech quality by means of the R factor expression (eq. 2). This is the main reason in adopting the E-model as a measurement tool. The E-Model not only takes in account the transmission statistics (transport delay and network packet loss), but it also considers the voice application characteristics, like the codec quality, codec robustness against packet loss and the late packets discard. According to [1,2,10], eq. 2 can be reduced to the following expression: R = 93.4 - Id(Ta) - Ie(codec,loss)

(3)

A new approach for determining the equipment impairment factor Ie factor in eq. 3 was presented in [1]. This work has been adopted by ETSI TIPHON group in [4,5,6] as an extended version of the E-Model early presented in [8]. The RFC 3611 specifies a RTCP extended report, the RTCP-XR, containing instantaneous values of R, based on the extended EModel. The Ie factor depends on the packet loss rates and the packet loss behavior is taken as uniformly distributed during all the call in [8]. For circuit switched telephony it can be true, but not for packet switched networks like TCP/IP. So some new concepts are presented in [1]: Instantaneous quality. This is the measured voice quality due to packet loss, delay, codec type and other impairments at some moment during the call. Perceived quality. Corresponds to the voice quality that would be reported by the user at some time during the call. Time varying packet-loss behavior. IP packet loss is bursty in its nature and, according to [1], it oscillates between burst and gap states. Burst is defined as a period of time bounded by lost and/or discarded packets with a high rate of losses and/or discards. Gap , in turn, is a period of time between two bursts [7]. Packet loss concealment (PLC) mechanisms work well at periods of low packet loss, but not at a bursty state. Recency effect. In MOS tests carried by telecom operators, it was noticed that the perceived quality varies according to the location of packet loss in the conversation time [1]. Hence, impairments that occur at the end of a call have a psychological effect more negative on the listener than impairments that occur at the beginning of a call [1,6]. According to [1], if the instantaneous voice quality changes from good to bad at some moment during the call, then it can be expected that the user initially would not be too concerned about this. However, after some time, the listener would become annoyed with the voice quality degradation. In [1] this recency effect is modeled by an exponential curve with time constant of 5s in the transition from good to bad and 15s in the transition from bad too good .

To represent the alternating behavior of packet loss, [1,11] use a 4-state Markov chain, which represents the conditions of receiving or losing a packet within high packet loss (burst) or low packet loss (gap) conditions . Thus, the Ie factor value can be separately obtained for burst (Ieb) and gap (Ieg) periods. Ieb and Ieg are both known as instantaneous Ie factors. The perceived user Ie factor, as said before, is modeled by an exponential curve, as follows (see Fig. 1): I1 = Ieb - (Ieg - I2) e-b/t1 I2 = Ieg + (I1 - Ieg) e-g/t2

(4) (5)

where I1 is the quality level at the change from burst condition to gap condition (Ieb to Ieg); I2 is the quality level at the change from gap condition to burst condition (Ieg to Ieb); b is the burst duration (in seconds); g is the gap duration (in seconds); t1=5s and t2=15s, typically.

Ie(av) values during the call, once there is an Ie(av) value for each pair of gap and burst periods.

2.2. The end of call quality As long as delay and loss rates can vary during the call, we can have many values of Id and Ie(av) and, consequently, of R and MOS for the same call. Sometimes, however, we need a unique score for describing the whole call. Thus, the final Id will be the weighted average of each instantaneous Id. The weights are the duration, in the call, of each Id value. Similarly, we can calculate a weighted value of Ie(av). However, the Ie value at the end of call (Ie(end_of_call)) is calculated considering again the recency effect. According to [1], Ie(end_of_call) is given by: Ie(end_of_call) = Ie(weig) + k (I1 - Ie(weig)) e-y/t3 (6) where Ie(weig) is the weighted value of Ie(av), I1 represents the exit value from the last significant burst of packet loss, t3 is a time constant of typically 30s, y represents the time delay since the last burst period and k is a constant set to a nominal value of 0.7 [1,3,6]. The R factor at end of call is obtained using eq. 2 and the final MOS score is given by eq. 1.

3. Corrections on E-Model implementation

Figure 1 - Recency effect representation.

According to [1] and observing Fig. 1, the recency effect model can be explained as follow: from an exit value I2, the Ie impairment factor exponentially increases to Ieb, with time constant of t1, in the transition between a gap period to a burst period; from an exit value I1, the Ie impairment factor exponentially decays to Ieg, with time constant of t2, in the transition between a burst period to a gap period. Clark integrates in [1] the I1 and I2 expressions to obtain an average Ie (Ie(av)), that corresponds to Ie(codec, loss) in eq. 3. Note that we can have many

While implementing the measurement tool we found some mistakes on ITU-T and ETSI specifications of the E-Model [3,10]. Here we will make a brief review of them. In general, these errors do not affect the E-Model in its essence, but they represent inconsistencies between some E-Model statements and how they are recommended to be implemented.

3.1. R to MOS conversion The meticulous reader will notice that in eq. 1 we will have MOS < 1, for 0 < R < 6.5. However, MOS is defined in [9] as a value between 1 and 5. So, we proposed in [3] the following correction to eq. 1: For R < 6.5: MOS = 1 For 6.5 R 100: MOS = 1 + 0.035 R + 7.10-6 R (R-60) (100-R) For R > 100: MOS = 4.5

(7)

As we said before, calls with R factor bellow 60 have poor quality. Thus, in practice, an R=8 call is not much better than an R=1 call. However, a MOS score bellow 1, by definition, does not exist and the measurement tool cannot produce such results.

3.2. I1 Transition impairment miscalculation As seen before, according to the Extended E-Model [1], the quality transition between two instantaneous values of Ie is modeled by an exponential curve. Thus, if we try to obtain the expression for I1 from Fig. 1, the correct formulae would be: I1 = Ieb - (Ieb - I2) e-b/t1

(8)

The wrong expression in eq. 4 can alter the correct calculation of Ie(av) and R, as consequence. However, this miscalculation was taken also in [4] and further revisions [5,6]. This fact was first reported in [10].

3.3. Burst period Packet-Loss Probability Basically, packet loss rate is the ratio between the number of lost packets and the number of total expected packets. By listening to the network interface, we can measure the global loss packet rate. By using the Markov chain model, we can separate the packet loss rates under gap from those under burst conditions. In [4,5,6] counters are used to determine the Markov model transition probabilities. These counters have to be manipulated in order to give the packet loss rate for both gap and burst periods. However, according to [3], the specification [4] and further revisions [5,6] get lost in some calculations in such way that one of the counters needed to determine the packet loss rate under burst period is doubly computed, leading to an erroneous value of Ieb. We address the reader to [3] for further details, but, by manipulating the counters presented in [4,5,6], in order to obtain packet loss rate as defined above, the reader will easily achieve the right result.

4. Voice Quality Measurements VoIP researchers and developers are always creating and improving mechanisms that conceal or compensate voice quality problems in VoIP networks. In order to verify how well these mechanisms work, measurement tools are needed. In this section, we describe how our measurement tool works and the scenarios where it was tested.

4.1. The measurement tool The measurement tool code is open-source and written in C language. It takes as input file a call trace containing delay, loss, codec type and frame duration information for each packet exchanged between two endpoints. The output is another file containing Id, Ie, delay, loss, R and instantaneous MOS values. These values can be plotted, giving a visual information about the voice quality variation along the call. It is not a real-time measurement tool, but it can be modified to work so. The input trace file must be taken at the VoIP client after the dejitter buffer, not at the network interface. The dejitter buffer adds some delay before the packets can be played, after the voice stream has passed the network interface. Besides, late packets that arrive at the network interface can be discarded at the dejitter buffer. So, only at the application level we can be sure about the real delay and packet loss that would be report by the listener. Although we have employed an application from the OpenH323 Project (callgen323) [13], others H.323 clients or even SIP clients can be evaluated utilizing our measurement tool. The trace file generated by the VoIP client have to be processed by a script before it can be passed as input for the tool, in order to follow a pattern structure. This way, we guarantee that the measurement tool can be portable.

4.2. Scenario under measurement In order to verify the measurement tool operation, we have to generate VoIP calls under known QoS conditions and evaluate its voice quality using the tool. An expressive amount of delay and lost packets have to be present in the call, at alternating burst and gap conditions, otherwise we always will have excellent MOS scores (about 4.5) and the measurement tool would not be completely tested. Thus we used the scenario illustrated on Fig. 2 to generate some calls that could be analyzed by the measurement tool.

It is important to note that the media stream went through the network backbone in one only direction in each situation. Considering that in a real conversation one person normally speaks while the other is listening, this approach is not so very far from reality.

5. Numerical Results The measurement tool output file contains instantaneous MOS values of one single call. For a stream of n simultaneous calls, we will have n output files. In Fig. 3 we plotted the MOS variation during two of 10 simultaneous calls from UFC to UFAM.

Figure 2 - Test bed scenario. The scenario is composed of two H.323 endpoints, registered each one with a Gatekeeper. One endpoint (092-647-4612) is located at UFAM, in Manaus city and the other (085-288-4371) is located at UFC, in Fortaleza city. The Gatekeepers are interconnected through the RNP2 backbone. UFC (at Fortaleza) and UFAM (at Manaus) are not directly interconnected. Not only the physical link passes through the RNP Management Center at Rio de Janeiro, but there is a Directory Gatekeeper (DGK) located there. The connections are not symmetric. UFAM has four 2Mbps ATM/Frame Relay links with RNP at Rio. UFC has a 34Mbps PDH link with RNP at Rio, as showed in Fig. 2. The tests were taken by playing a pre-recorded .wav audio file from one endpoint to the other one. This file is a speech report, available at [12]. It contains human speech, with pauses among the regular voice activity; it does not contain background music or noises and its duration is above 5 min, like a normal phone call. The voice extracted from the .wav file was coded by callgen323 with G.711 µ-Law codec, 30ms frame length per packet. The G.711 bit rate is 64kbps, without headers [3]. RTP, UDP and IP headers sums a total of 40 bytes [3]. Thus, the bandwidth required by each voice channel was about 75 kbps. In the first part of the experiment, we generated calls from UFAM to UFC. We start with a voice stream containing one single call and then we repeated the procedure using 5, 10, 20, 30 and 40 simultaneous calls per stream. In the second part, we did the same in the other direction, from UFC to UFAM, in order to compare the quality scores. In this part of the experiment, all the calls were of 3 min long.

Figure 3 - MOS scores for 10 simultaneous calls from UFC to UFAM. Although call #5 and #6 are in the same voice stream of 10 simultaneous calls, their MOS variation curves are not the same, due to network randomness conditions. We chose these calls because they show perceptible MOS variation along them and the lack of space to show all ten calls. The last line of the output file contains the average MOS value for the whole measured call. For each stream of simultaneous calls, we took the mean value and the standard deviation of the average MOS measured in all calls in the stream. This result is plotted in Fig. 4.

Figure 4 - Average MOS scores for calls flowing from UFC to UFAM. As it can be seen from Fig. 4, the mean MOS value decreases as the number of simultaneous calls from UFC to UFAM increases. Besides, the error bars increases as the number of simultaneous calls increases, indicating that the voice quality of each call in the same stream are not equally affected. Indeed, we have some calls with call quality scores much better than others. In the case of 40 simultaneous calls, the network backbone got so stressed that the voice quality values of all them are almost 1, the lowest possible value by definition. Thus, the standard variation is not as large as expected. The MOS variation during two of 10 simultaneous calls established from UFAM to UFC is plotted in Fig. 5.

the average MOS obtained in the voice calls on the opposite flow. The different average MOS scores for opposite flows was expected because the IP network is not symmetric. Downlink and uplink throughput are different. Other applications were using the RNP backbone at the experiments moment, the RNP network core at Rio de Janeiro handles with many other nodes besides that located at UFAM and UFC and the routers have not QoS schemes enabled, which causes high packet discard at high traffic peaks. To complete our analysis, we plotted in Fig. 6 the average MOS scores for the call flows from UFC to UFAM, in opposite to Fig. 4.

Figure 6 - Average MOS scores for calls flowing from UFAM to UFC.

Again, the average MOS decreases as the number of simultaneous calls from UFAM to UFC increases. However, the voice quality levels are higher than those presented in the graph of Fig. 4, referenced to the same number of simultaneous calls, showing again the asymmetric characteristic of the network backbone.

6. Conclusions

Figure 5 - MOS scores for 10 simultaneous calls from UFAM to UFC. As it can be seen by comparing Figs. 3 and 5, not only the two calls from UFAM to UFC (Fig. 5) have maintained the same quality level throughout their duration, but also this MOS level was much better than

We have implemented a measurement tool based on the ITU-T E-Model and its further extension proposed by ETSI. We have found three mistakes in the EModel calculation, as defined in the ITU-T specifications and corrected them in the tool. These errors do not affect the E-Model in its essence, but they represent inconsistencies between some E-Model statements and how they are implemented. We have tested the measurement tool in some calls generated through the RNP backbone, between two endpoints located at different Brazilian cities. The tool

was able to capture the asymmetric behavior of the backbone. As future work, we plan to utilize the tool measures (MOS, packet loss, delay, etc.) as input to an intelligent agent that can make decision on the traffic (voice and data) management in order to improve the voice quality.

Acknowledgments The authors would like to acknowledge the collaboration of Wellington Albano and Francisco Dalla, at PoP-CE, and Eduardo Bezerra, at PoP-AM, for configuring the equipments and running the experiments. In special, we are grateful to Nokia Technology Institute (INdT), for supporting our research activities.

References [1] A. D. Clark. Modeling the Effects of Burst Packet Loss and Recency on Subjective Voice Quality. Columbia University IP Telephony Workshop, 2001. [2] R. G. Cole and J. H. Rosenbluth. Voice over IP Performance Monitoring. ACM SIGCOMM, 2001.

[3] L. S. G. de Carvalho. An E-Model Implementation for Objective Speech Quality Evaluation of VoIP Communication Networks. Master Thesis, 2004. [4] ETSI TS 101 329-5 v1.1.1. Quality of Service (QoS) Measurement Methodologies. 2000. [5] ETSI TS 101 329-5 v1.1.2. Quality of Service (QoS) Measurement Methodologies. 2002. [6] ETSI TS 102 024-5 v4.1.1. Quality of Service (QoS) Measurement Methodologies. 2003. [7] T. Friedman, R. Caceres, A. Clark, K. Almeroth, R. G. Cole, and N. Duffield. RTP Control Protocol Extended Reports (RTCP XR). IETF RFC 3611, 2003. [8] ITU-T Rec. G.107. The E-Model, A Computational Model For Use in Transmission Planning. 2003. [9] ITU-T Rec. P.800. Methods For Subjective Determination of Transmission Quality. 1996. [10] L. C. G. Lustosa, L. S. G. Carvalho, P. H. A. Rodrigues, and E. S. Mota. E-Model Utilization For Speech Quality Evaluation Over VoIP-Based Communication Systems. 22nd SBRC, 2004. [11] A. P. Markopoulou, F. A. Tobagi, and M. J. Karam. Assessment of VoIP Quality over Internet Backbones. Infocom02. [12] NASA. Haughton-Mars Project Home Page. http://www.marsonearth.org/interactive/reports/2001. [13] OPENH323 PROJECT. http://www.openh323.org/.

This document was created with Win2PDF available at http://www.daneprairie.com. The unregistered version of Win2PDF is for evaluation or non-commercial use only.