speech of A or B or both (double talk), mutual ... where A and B denote the speech activity produced by .... values for parameters AUN, A, BUN, B, SA, SB, SC,.
Derivation of Speech Activity Parameter Values in the Context of Speech Quality Testing Peter Počta1, Miroslava Mrvová1 1
Dept. of Telecommunications and Multimedia, FEE, University of Žilina, Univerzitná 1, SK-01026, Žilina, Slovakia, {pocta, mrvova}@fel.uniza.sk
Abstract As proven by many scientific papers, the time-varying impairments play crucial role in VoIP applications. On the other hand, the reference signals used for speech quality assessment are characterized by following parameters: length of the signal and speech activity parameter. Despite the facts that activity parameter is one of the important characteristics of reference signals for objective speech quality measurements defined in Section 7 of the ITU-T Recommendation P.862.3 (also considered in brand new ITU-T Recommendation P.863) and has been proven as crucial input parameter for subjective and objective speech quality assessment in presence of packet loss, the exact values of this parameter with regard to different conversation scenarios are still missing. This study addresses this shortcoming by deriving a formula for computing activity parameter of arbitrary conversation scenario. A serviceability of the proposed formula is demonstrated. Finally, other issues related to creating the reference speech samples for speech quality assessment (number of speech utterances and sample pattern, etc.) and potential application areas of the derived formula are pointed out. Keywords Speech quality assessment, Time-varying impairments, Packet loss, Reference signal characteristic, Speech activity parameter.
1
Introduction
Voice over Internet Protocol (VoIP), the transmission of packetized voice over IP networks, has gained much attention in recent years. It is expected to carry more and more voice traffic for its cost-effective service. However, present-day Internet, which was originally designed for data communications, provides best-effort service only, posing several technical challenges for real time VoIP applications. In this case, speech quality is mainly impaired by packet loss and jitter. Packet loss is the most crucial of the possible time-varying distortions, since it degrades speech quality considerably, and is directly linked to the packet-based transmission technique. It was found that this type of time-varying distortions (packet loss) is different from stationary degradations in at least one perceptual dimension (referred to as ‘continuity’ or ‘smoothness of speech’; [1]). Moreover, the instantaneous perception of timevarying quality was also investigated, for two different underlying quality elements (signal-correlated noise and packet loss, respectively; [2, 3]). The studies by Gros and Chateau [2] and similar approaches have related the
integral speech quality of longer passages to speech quality ratings obtained instantaneously, or obtained for shorter passages [4-6]. The additional results regarding time-varying speech quality and distortions are presented in [7, 8], respectively. In general, the telecommunication networks regardless of the deployed transmission technique were designed to transfer any type of speech communication. In particular, main characteristics of speech communication can vary according to purpose (getting information or solving the problems) and type of a communication (free conversation or expressive communication). For instance, call with aiming at getting travel information (e.g. rail travel information service) can be characterized by moderate or low interactivity (depends on issue), mainly a few longer speech utterances and short duration of communication. Contrariwise, the business call (two partners are expressively discussing the business issues) can be described by high interactivity and high number of shorter speech passages and a relatively higher length of call as in previous case. The interactivity issue was well studied by Hammer at al., for instance in [9]. They
defined the notion of “conversational temperature” as intuitive scalar metric for interactivity of communication. An amount of active speech contained in the call can be defined as speech activity. In some literature, the speech activity can be also called an Active-Speech-Ratio or speech activity parameter. The both parameters are defined as the ratio in percentage points between the amount of active speech and the length of communication or sample. Despite the facts that speech activity parameter is one of the important characteristics of reference signals for objective speech quality measurements defined in Section 7 of the ITU-T Recommendation P.862.3 [10] (also considered in brand new ITU-T Recommendation P.863) and has been proven as crucial input parameter for subjective and objective speech quality assessment in presence of packet loss [11-13], any work available did not specify this parameter with regard to different conversation scenarios or types of communication. On the basis of the facts mentioned above, it seems to be important to define this parameter for all types of communication applied in current telecommunication networks. By this process, we might also refine on the existing broadly recommended Active-Speech-Ratios (40% - 80%), defined by ITU-T Recommendation P.862.3 and provide for more reliable speech quality assessment provided by subjective tests and objective models, like P.862, P.563 and P.863, especially for time-varying impairments. In this paper, we start with parametric analysis of the conversations. On the basis of this analysis, we derive a formula for computing speech activity parameter of arbitrary conversation scenario. In order to demonstrate an applicability of this formula, the conversation tests for frequently occurring conversation scenarios at present time are realized. Finally, other issues with respect to producing the reference samples for speech quality assessment (number of speech utterances and sample pattern, etc.) and application areas of the derived formula are discussed. The rest of the paper is organized as follows. Section 2 introduces the parametric analysis of conversations. Section 3 presents the derivation of formula for calculating activity parameter of arbitrary conversation scenario. Sections 4 and 5 describe the conversation tests and their results, respectively. Section 6 concludes the paper.
2
Parametric analysis of conversations
2.1 Conversation model According to [14, 15], a two-way conversation can be divided into four different states, as illustrated in Figure 1.
Fig. 1 Illustration of conversation states (adopted by Hammer et al. [9]) States A and B denote that either person A or person B is talking and the other person does not speak. State M (mutual silence) reflects the situation that both persons are silent and state D (double talk) represents the case that both persons talk simultaneously. 2.2 Parameters of conversation In principle, we can generally describe an arbitrary conversation by following parameters: • Type of communication (free communication, task oriented communication, expressive communication, etc.), • Length of conversation, • Number of talkspurts in conversation, • Sojourn times in corresponding states (active speech of A or B or both (double talk), mutual silence), • Number of state changes (see Figure 1), • Importance of direction of communication. The analytical analysis of the above listed parameters revealed that just one parameter does not have an impact on speech activity parameter, namely parameter called number of state changes. This parameter is related to interactivity of conversation and can be also characterized by conversational temperature defined by Hammer et al in [9]. The remaining parameters will be involved in deriving the formula for computing activity parameter of arbitrary conversation scenario.
3 Derivation of a formula for computing speech activity parameter of arbitrary conversation scenario In this section, we will define the equation for calculating the speech activity parameter. As pointed out above, this parameter is function of the following parameters: type of communication (T), length of conversation (Lc), number of talkspurts in conversation (N), sojourn times in corresponding active states (STA), importance of direction of communication (I). Generally, the simplified formula for activity parameter (AP) can be defined as follows:
AP (T , Lc , N , STA.I ) = α (T ) ⋅ A(T , LC , N A , STAA )
+ β (T ) ⋅ B (T , Lc , N B , STAB )
, (1)
where A and B denote the speech activity produced by person A or B respectively, α and β represents the importance’s of direction of communication. Naturally, the definitions of parameters A, B and α, β are symmetrical by construction. It is a reason, why the equations for A and α will be only presented in a following text. We will start the derivation process of this formula with the unnormalized version of parameter A, which is given by the following equation: NA
AUN = 100 ⋅ ∑ STAAi / LC , [%, -, s, s] (2) i =1
where NA represents number of talkspurts generated by person A, STAAi denotes sojourn time in corresponding active state generated by person A (see state A in Figure 1) and LC is length of the corresponding scenario in seconds. For our purpose, we can only take into account sojourn times of active states because the remaining sojourn times are represented by silence periods and corresponds either to talkspurts generated by person B or to mutual silence. Generally, we can say that: NA
NB
i =1
i =1
∑ STAAi + ∑ STABi ≅ LC , (3)
and N A + N B ≅ N . (4)
The further parameter characterizing the conversation is length of the conversation. This parameter heavily depends on type of communication realized over the phone and also influences the sojourn times in corresponding state and other parameters like number of talkspurts, etc. In general, each conversation has a different length depending on purpose and type of conversation. Regarding this model, it would be useful to have average duration of conversation (call) across all possible conversation scenarios at hand to also relatively compare the duration of the corresponding communication with the average duration of communications realized in current networks in final formula. Unfortunately, to the best of our knowledge, such value is not currently available and finally we decided to define this value themselves. In principle, it is possible to easily specify this parameter by averaging the durations of frequently occurring conversation scenarios at this time by assuming that those scenarios cover all important types of the communications realized in current networks. Some examples of such scenarios are described in [16, 17]. Let us denote by L the average duration of conversation across frequently occurring scenarios in current networks. To having such
value at hand allows us to define the normalization parameter as follows: LC . [-, s, s]. (5) L
n=
After applying the normalization parameter, the equation for normalized version of the above mentioned the parameter A is following: NA
A = 100 ⋅ n ∑ STAAi / LC . [%, -, -, s, s]. (6) i =1
After trivial mathematical operation, the final formula for A has this form: A=
100 N ∑ STAAi . [%, s, -, s]. (7) L i=1 A
The remaining parameters in the simplified formula are parameters α and β. As mentioned above, those parameters describe importance of direction of communication. In other words, they would like to express, which of the directions transfers more important information. As more important information, we assume the information dedicated to the purpose of the call or conversation enabled by that call. It is easy job to define such direction with regard to task oriented communication, like getting travel information, etc. On the other hand, it is much demanding task when we would like to specify it for free conversation because the purpose of the call is not defined so precisely as in previous case. For our purpose, we can simplify the derivation of this parameter by assumption that much important direction is direction transferring higher number of sentences. In principle, each sentence provides additional or new relevant information and the loss of such information can negatively influence the communication (the purpose of the call is only partly fulfilled) and eventually its quality. Naturally, the number of sentences involved in corresponding conversation mainly depends on type of the communication. Moreover, the usage of the sentences for deriving the importance of the direction of communication makes this parameter independent of the language deployed. If we would use for instance number of transferred words, the above mentioned statement would not be valid because the information with similar meaning can be described by different number of words in different languages. In order to have a relative value of this parameter, the number of sentences per each direction is normalized by total number of sentences in given communication. Finally, the equation representing discussed parameter can be defined as follows: α=
SA , [-, -, -] (8) SC
where SA denotes the number of sentences said by person A (transferred from A to B) and SC is the total number of sentences of the corresponding communication spoken by both persons. By doing this normalization, this parameter can have values in range from 0 to 1, whereas the sum of those parameters (α and β) equals 1. After the substitution of the equations (7) and (8) and their symmetrical versions for parameters B and β, the final formula for computing speech activity parameter is given as follows: AP =
4
N 100 N ⋅ S A ∑ STAAi + S B ∑ STABi . (9) i =1 S SC L i =1 A
B
Description of conversation tests
4.1 Experimental scenario For the purpose of demonstrating the serviceability of the derived formula, the conversation tests for frequently occurring conversation scenarios at present time were done according to [16]. The main aim of those conversation tests was to specify the desired input parameters of the formula for the investigated scenarios. The language used was Slovak. To have a sufficient amount of statistical data, all together 48 untrained subjects (22 male, 26 female, 21-29 years, mean 23.6 years) participated in the tests. During the selection process of the participants, the following aspects to cover all possible combinations (impacts) were taken into account: gender, region of origin and the other socio-economic characteristics, like education, etc. In total, 144 conversations (6 types of conversation scenarios (see details below in section 4.2) and 24 conversations per scenario (48 subjects participated in the tests)) were performed and recorded (the microphone signals of both speakers). One test person was seated in regular office environment with basic acoustic measures like sound absorbing and furniture and with the background noise well below 20 dB SPL (A). Second test person was seated in similar office environment. The background noise there was under 20 dB SPL (A) as well. The conversational tests consisted of VoIP connections using G.711 codec with no transmission impairments (isolated network). The transmission conditions were precisely controlled during all tests realized in this experiment. At the beginning of test session, the participants were instructed according to [16]. In addition, before the start of test and each conversation, the controlled period of time for familiarization with each other and with the scenario respectively was given to the test subjects. They were also instructed to have natural behaviour (as during real telephone interaction) as much as possible to obtain authentic results.
Finally, the desired parameters of conversations have been specified and used in derived formula as input parameters. 4.2 Conversation scenarios The conversation scenarios were chosen to cover all kinds of communication applied in current telecommunication networks. In principle, they can be divided into three groups: • Task oriented conversation, • Free conversation, • Expressive communication. The first group called “Task oriented conversation” consists of three types of conversations, namely ordering pizza, getting travel information and getting weather information. The project and family conversations are included in the second group (Free conversation). The third group called “Expressive communication” is only characterized by dispute of two project partners. Most of the conversations listed above were created according to conversation scenarios described in [16, 17]. A translation of English, German or Czech scenarios was only realized in some cases, like ordering pizza, etc.
5
Test results
After recording all scenarios involving 48 subjects (implicating 24 conversations), the recordings were processed to specify the desired input parameters of formula. First of all, the length of particular scenario defined as LSC, the average duration of conversation across frequently occurring scenarios in current networks denoted above as L and normalization parameter n were defined. The average values of the parameters mentioned above (averaged over 24 conversations per scenario or all 144 conversations (6 scenarios * 24 conversations) in case of L) are described in Table 1. Moreover, Table 2 and 3 provide the average values for parameters AUN, A, BUN, B, SA, SB, SC, α, β, respectively. Those values were similarly averaged over 24 conversations per scenario. Table 1 Average values of the parameters L, Lsc and n. Scenario LSC [s] n [-] L [s] Ordering pizza 57 0.53 Getting travel information 176 1.63 Getting weather information 58 0.54 108 Family conversation 100 0.93 Project conversation 150 1.39 Dispute 108 1 After substituting the parameters specified above into the derived formula (9), the values of speech activity parameter for the investigated conversation scenarios in Slovak were obtained, see Table 4. It should be noted
here that we used very complex task in scenario called “getting travel information” (We asked the subjects to find out a bit complicated international rail connection.), which seems to had high impact on the final values of the activity parameter. It is expected that this scenario normally implies lower activity parameter as obtained in this study. In general, the deployed scenarios can also characterize other scenarios obtaining in real life situations with similar characteristics (e.g. interactivity, number and length of speech utterances). Table 2 Average values of the parameters AUN, BUN, A and B. AUN A BUN B Scenario [%] [%] [%] [%] Ordering pizza 45.8 24.3 51.1 27.1 Getting travel 78.4 127.8 20.8 33.9 information Getting weather 69 37.3 29 15.7 information Family conversation 33 30.7 64 59.5 Project conversation 39.6 55 58 80.6 Dispute 38 38 59.6 59.6 Table 3 Average values of the parameters SA, SB, SC,α and β. SA SB SC α β Scenario [-] [-] [-] [-] [-] Ordering pizza 14 13 27 0.52 0.48 Getting travel information 12 25 37 0.32 0.68 Getting weather 12 0.4 8 20 0.6 information Family conversation 44 25 69 0.64 0.36 Project conversation 23 20 43 0.54 0.46 Dispute 37 24 61 0.61 0.39 As described above in Section 1, in addition to activity parameter, the conversation situation can also be characterized by number and length of speech utterances in conversation. In principle, activity parameter only gives us information about relative amount of speech in corresponding conversation. Nevertheless, both parameters mentioned above (number and length of speech utterances in conversation) are explicitly included in activity parameter by parameters NA, NB, and STAA, STAB, respectively. To distinguish between scenarios from the length of utterance perspective, the average utterance duration (UD) in percentage points is defined as follows: A B + NA NB UD = . (10) 2
In essence, the average utterance duration (UD) can also be called “average sojourn time in active state”. Table 5 shows the average values of UD parameter and of the parameters used in the formula defined above for computing UD parameter. It is possible to see from Table 5 that for instance dispute as expressive communication can be described by high number of relatively short speech utterances (high values of parameters NA and NB, low value of parameter UD). On the other hand, project conversation can be characterized by lower number of relatively longer speech utterances (low values of parameters NA and NB, high value of parameter UD). We should keep in mind those facts together with the computed speech activity parameters (see Table 4) when creating the reference speech samples for speech quality assessment. In other words, the type of communication should be reflected in a created reference sample. For instance, we could include high number of shorter sentences (reflecting the computed activity parameter) in reference sample modelling the expressive communication. Contrariwise, a few longer sentences could be used for emulating free conversation. Moreover, it should be desirable to keep same length of reference sample for all investigated communications and to only vary the activity parameter to obtain the stable and reliable results in speech quality assessment. Table 4 Final values of activity parameter for the investigated conversation scenarios. Activity parameter Scenario [%] Ordering pizza 25.6 Getting travel information 63.9 Getting weather information 24.3 Family conversation 41.1 Project conversation 66.7 Dispute 46.4 The other issue related to creating the reference speech samples for speech quality assessment is the sample pattern. In other words, how the sentences are placed in the reference sample. To be honest, this issue is very complex but it can be statistically neglected by using more reference samples with different patterns. The usage of more speech samples (at least 4; 2 male and 2 female samples) is suggested by current ITU-T Recommendations, like P.800 or P.862, etc.
Table 5 Average values of NA, NB, A/NA, B/NB and UD for the investigated conversation scenarios. NA A/NA NB B/NB UD Scenario [-] [%] [-] [%] [%] Ordering pizza 10 2.4 10 2.7 2.55 Getting travel 9 14.2 8 4.2 9.2 information Getting weather 5 7.5 4 3.9 5.7 information Family conversation 17 1.8 16 3.7 5.5 Project conversation 11 5 10 8.06 6.53 Dispute 18 2.1 18 3.3 2.7
6
Conclusion
In the present paper, the formula for computing speech activity parameter of arbitrary conversation scenario was derived and its applicability for frequently occurring conversation scenarios at present time was demonstrated. The final values of activity parameter for the investigated conversation scenarios in Slovak were obtained. The values for other languages can easily be calculated by the proposed formula. In addition, we also discussed other issues with regard to making the reference speech samples, namely number of speech utterances and sample pattern, etc. The derived formula can be useful for network operators and service providers which are directly interested in speech quality related to some kinds of communications or dominant types of communications carrying by the networks (for instance, network operators/service providers’ carrying/providing special types of services, like automatic services based on spoken-dialogue systems). Moreover, the information presented in this paper could also help us to realize more reliable speech quality assessment, especially with regard to timevarying impairments.
7
Acknowledgement
This work has been supported by the Centre of excellence for systems and services of intelligent transport II., ITMS 26220120050 supported by the Research & Development Operational Programme funded by the ERDF.
"Podporujeme výskumné aktivity na Slovensku/Projekt je spolufinancovaný zo zdrojov EÚ"
8
References
[1] V. V. Mattila: Perceptual analysis of speech quality in mobile communication, Doctoral dissertation,
vol.340, Tampere University Tampere (Finland), 2001.
of
Technology,
[2] L. Gros, N. Chateau: Instantaneous and overall judgements for time-varying speech quality: assessments and relationships, In Acta Acustica, vol.87, No.3, pp. 367-377, 2001. [3] M. Hansen, B. Kollmeier: Continuous assessment of time-varying speech quality, In Journal of Acoustical Society of America, vol. 106, No.5, pp. 2888-2899, 1999. [4] ITU-T COM 12-21: An experimental investigation of the accumulation of perceived errors in timevarying speech distortions, Source: British Telecom, UK (M. Hollier), International Telecommunication Union, Geneva (Switzerland), 1997. [5] ITU-T Delayed Contribution D.064: Testing the Quality of Connections Having Time Varying Impairments, Source: AT&T, USA (J. H. Rosenbluth), International Telecommunication Union, Geneva (Switzerland), 1998. [6] U. Jekosch: Sprache hoeren and beurteilen: Ein Ansatz zur Grundlegung der Sprachqualitaetbeurteilung, Habilitation thesis, Essen (Germany), 2000. [7] S. Voran: A basic experiment on time-varying speech quality, In Proceedings of MESAQIN 2005, Prague (Czech Republic), pp. 51-65, 2005, ISBN 80-01-03262-0. [8] A. Raake: Speech quality of VoIP: Assessment and prediction, John Wiley&Sons, Chichester (United Kingdom), Chapter 4, pp. 111-173, 2006, ISBN 0470-03060-7. [9] F. Hammer, P. Reichl, A. Raake: The WellTempered Conversation. Interactivity, Delay and Perceptual VoIP Quality, In Proceedings of IEEE ICC 2005, Seoul (South Korea), May 2005. [10]ITU-T Rec. P.862.3: Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2, International Telecommunication Union, Geneva (Switzerland), 2005. [11] P. Počta, M. Mrvová, P. Kortiš, P. Palúch, M. Vaculík: A systematic study of PESQ’s behavior in simulated VoIP environment (from reference signal characteristics perspective), In Proceedings of MESAQIN 2008, Prague (Czech Republic), 2008, pp. 13-21, ISBN 978-80-01-04193-2. [12]P. Počta, M. Mrvová, J. Holub: Impact of Different Active-Speech-Ratios on PESQ’s Predictions in
Simulated VoIP Environment, In Acta Acustica and united with Acustica, vol. 95, No.5, pp. 950-957, ISSN 1610-1928. [13]P. Počta, H. Vlčková, Z. Polková: Impact of Different Active-Speech-Ratios on PESQ’s Predictions in Case of Independent and Dependent Losses, In Proceedings of MESAQIN 2009, Prague (Czech Republic), 2009, pp. 13-20, ISBN 978-8001-04361-5. [14]ITU-T Rec. P.59: Artificial conversational speech, International Telecommunication Union, Geneva (Switzerland), 1993. [15]F. Hammer, P. Reichl, A. Raake: Elements of Interactivity in Telephone Conversations, In Proceedings of ICSLP 2004, Jeju Island (South Korea), pp. 1741-1744, 2004. [16]ITU-T Recommendation P.805: Subjective evaluation of conversational quality, International Telecommunication Union, Geneva (Switzerland), 2007. [17]J. Holub, M. Kastner, O. Tomiska: Delay effect on conversational quality in telecommunication networks: Do we mind, In Proceedings of WTS 2007, Pomona (USA), 2007.