Comparison of subjective and objective speech quality assessment for ...

Comparison of Subjective and Objective Speech Quality Assessment for Different Degradation / Noise Conditions Rajesh Kumar Dubey

Arun Kumar

Department of Electronics and Communication Engg. Jaypee Institute of Information Technology Noida, India [email protected]

Center for Applied Research in Electronics, Indian Institute of Technology Delhi, India. [email protected] subjective MOS. But, it is considered to be time consuming and cumbersome for speech quality assessment and also not suitable for automation of the speech processing algorithms. Thus, to supplement the subjective listening test, objective speech quality assessment methods are becoming increasingly important. The objective speech quality assessment methods based on the mathematical algorithms measure the speech quality and give the Mean Opinion Score-Listening Quality Objective (MOS-LQO) or the objective MOS. The performance of objective speech quality assessment methods are measured in terms of correlation between the subjective MOS and the objective MOS. Thus, for performance evaluation, all objective speech quality assessment methods require the subjective MOS, obtained from the subjective listening test. In this work, the subjective listening test has been done to create the subjective MOS assigned speech database for the different applications. The subjective and objective speech quality assessment methods are shown in Fig. 1. The objective speech quality assessment methods may be Intrusive (double-ended) or Non-intrusive (single-ended). Intrusive methods require original clean speech as reference while Non-intrusive methods do not require clean speech for reference. There are International Telecommunications unionTelephone (ITU-T) standards for speech quality assessment algorithms: ITU-T Rec. P.863 is used for intrusive methods and ITU-T Rec. P.563 is used for Non-intrusive methods.

Abstract—Objective speech quality assessment is done to replace the time taking and cumbersome subjective listening test to assess the quality of degraded speech processed by different speech processing algorithms. For performance evaluation, all objective speech quality assessment algorithms require the Mean Opinion ScoreListening Quality Subjective (MOS-LQS) or subjective MOS obtained from the subjective listening test for comparison with the Mean Opinion Score-Listening Quality Objective (MOS-LQO) or objective MOS. This paper gives details of the subjective database created for the different applications. In this work, subjective speech quality assessment has been done for speech utterances at different degradation/noise conditions to assign the MOSLQS or subjective MOS and compared with MOS-LQO or the objective MOS obtained from an algorithm based on relevant auditory perception features for an illustration. Two different databases NOIZEUS-960 and NOIZEUS2240 having 3200 speech utterances processed by different speech enhancement algorithms at different types of real life noise conditions have been used for subjective listening test by real life users. The statistical analysis of the MOSLQS assigned by different subjects for different degraded speech utterances has been done and its details are reported in the paper. Further, the correlation coefficients are computed between the subjective MOS and the objective MOS for both the databases for an illustration and inferences are drawn regarding the usefulness of the subjective MOS in speech quality assessment algorithms.

Input Speech

Keywords—Subjective Listening Test; Objective Speech Quality; Mean Opinion Score; Speech Degradations.

I. INTRODUCTION

Objective speech quality assessment (Intrusive)

The objective evaluation of speech quality is essential in speech processing algorithms to monitor and maintain the quality of service (QoS) and system automation. A speech quality enhancement measure can be adopted in the system involving speech processing algorithms or a feedback can be given to a mobile base station for adaptive bandwidth allocation in the communication system, if the quality of speech is not up to the threshold level for customer satisfaction. The ideal way of speech quality assessment is the subjective listening test. The absolute category rating (ACR) is used for the subjective speech quality assessment as given in ITU-T Recommendation P.800 [1]. In this method, speech material is played to a group of human listeners and their opinions are gathered about the quality of degraded speech utterance. The average of the opinion scores is called the mean opinion score-subjective listening quality (MOS-LQS) or the

978-1-4799-6761-2/15/$31.00 ©2015 IEEE

System under test/ Communication network/ codecs

Received speech

Subjective speech quality assessment (Experience, expectation, semantic)

Subjective MOS Objective MOS

Objective speech quality assessment (Non-intrusive)

Objective MOS Fig. 1: Subjective and objective speech quality assessment.

261

The low complexity Non-intrusive (single ended) objective speech quality assessment method using different local and global features is explained in [2]. The objective MOS or MOS-LQO is obtained by Gaussian Mixture Model (GMM) mapping of different local and global features obtained from speech coders without considering any degradation model by Expectation Maximization (EM) algorithm [3]. The mel-frequency cepstral coefficients (MFCC) are one of the most widely used feature representations of the speech signal frame that captures the vibrations of basilar membrane (BM) of the human ear’s critical bandwidth with frequency, are used for the speech quality assessment using support vector regression [4]. A combination of MFCC, perceptual linear prediction (PLP) and line spectral frequencies (LSF) features used for Non-intrusive speech quality assessment is reported in [5]. In this work, the subjective listening test has been conducted for the different degraded speech utterances to obtain the subjective MOS or MOS-LQS. To see the performance of the objective speech quality assessment method and the usefulness of the subjective MOS or MOSLQS in performance evaluation, an objective speech quality assessment method has been implemented for illustration to compute the objective MOS or MOS-LQO for the same degraded speech utterances by joint GMM mapping using auditory perception features by Expectation Maximization (EM) algorithm. The parameters of the joint GMM and the auditory perception features are used to estimate the objective MOS of the speech utterances. The correlation between the subjective and the objective MOS is computed and compared with ITU-T Recommendation P.563 [6].

The database NOIZEUS-960 is a noisy speech corpus of 960 speech utterances and was obtained from NOIZEUS database [7]. The database contains 30 clean speech utterances which have been subjected to 32 different degradation conditions of noise at several SNR values. Noise signals for degradations of the speech utterances of the database were taken from the AURORA database [10] and the noises included are: airport, babble, car, exhibition hall, restaurant, train station, street, and suburban train. The noise signals were added to the speech signals at four SNRs of 0 dB, +5 dB, +10 dB, and +15 dB to obtain a total of 960 speech files. The NOIZEUS-960 database according to degradations of additive noise at four SNR is given in TABLE 2. TABLE 1: NOIZEUS-2240 speech database according to degradations. Database

Types of additive noise

NOIZEUS -2240

SNR

Babble Car Street

5 dB 10dB

Train

Speech Processing algorithms MMSESTSTA spectral subtraction subspaceapproach Weiner filtering

Total

No. of algorithms

No. of speech utterances

6

560

3

560

2

560

3

560 2240

TABLE 2: NOIZEUS-960 speech database according to degradations. Database

II. SUBJECTIVE LISTENING TEST ON NOIZEUS-2240 AND NOIZEUS-960 DATABASES

NOIZEUS-960

The noisy speech corpus NOIZEUS-2240 having 2240 speech utterances, obtained from the University of Texas, Dallas, USA and NOIZEUS-960 consisting of 960 speech utterances [7], were used for the subjective listening tests for speech quality assessment. The subjective listening tests were conducted in our laboratory (Speech & Audio Laboratory, CARE, IIT Delhi, India) for obtaining the subjective MOS score or MOS-LQS. The 21 subjects listened to these speech utterances and scored them on a scale of one to five: 5excellent, 4-good, 3-fair, 2-poor and 1-bad. An average of their opinion score was computed to obtain the subjective MOS score for each degraded speech utterance. The database NOIZEUS-2240 contains 20 clean speech utterances with different suppression schemes like MMSE-STSTA, spectral subtraction, subspace-approach and Wiener filtering. There are 14 different speech enhancement and noise suppression schemes namely MMSE-STSTA (6 algorithms), spectral subtraction (3 algorithms), subspace-approach (2 algorithms) and Wiener filtering (3 algorithms) are used to process each speech sentence. These sentences are degraded by 4 different types of additive noise namely babble, car, street, and train noise. These conditions have been applied at 2 SNR levels of +5dB and +10 dB to obtain a total of 2240 speech utterances at 112 different conditions similar to [8] and [9], where 13 speech processing algorithms have been used. The NOIZEUS2240 speech database description according to degradations of additive noise at two SNR is given in TABLE 1.

Types of additive noise Airport Babble Noise Car Exhibition Restaurant Station Street Train Total

SNR

0 dB 5 dB 10 dB 15 dB

No. of speech utterances 120 120 120 120 120 120 120 120 960

A statistical analysis of the subjective ratings has been done similar to [8] for NOIZEUS-2240 database speech utterances to ensure the high degree inter-and-intra-rater reliability. The mean and standard deviation of MOS are computed for the different speech processing algorithms for different degradations/noises and results are shown in Fig. 2 and Fig. 3 in the form of bar chart along with the results given in [8]. In Fig. 2, the mean values of MOS are represented for babble noise at +5 dB SNR and +10 dB SNR for illustration. Similarly, in Fig. 3 standard deviation values of MOS are represented in the form of bar chart for babble noise at +5 dB SNR and +10 dB SNR. The precise values of mean and standard deviations are given in TABLE 3 for babble noise at +5dB SNR and in TABLE 4 for babble noise at +10dB SNR for illustration. The average value of subjective MOS is 2.03 and standard deviation is 0.76 for babble noise at +5 dB SNR in Hu & Loizou’s work. The respective values in our subjective ratings are 2.31 and 0.78 for babble noise at +5 dB SNR. Similarly, the average value of subjective MOS is 2.61 and standard deviation is 0.78 for babble noise at +10 dB SNR

262

in Hu & Loizou’s work. The respective values in our subjective ratings are 2.64 and 0.87 for babble noise at +10 dB SNR. A similar analysis has also been done for all speech processing algorithms for other noises such as car, street, and train.

shown in figure 4 at different SNR such as (i) 0 dB, (ii) +5 dB, (iii) +10 dB, and (iv) +15 dB.

TABLE 3: Mean and standard deviations of subjective MOS.

Objective speech quality assessment has been done and results are compared in terms of correlation between the subjective MOS and the objective MOS. The objective MOS corresponding to the objective speech quality assessment is computed using 14-dimensional reduced size Lyon’s auditory feature vector by GMM probabilistic approach. The feature vector obtained from 64-channel Lyon’s auditory model [11] is 256-dimensional, where first 64 elements are mean, next 64 are variance, next 64 are skewness and rest 64 are kurtosis values of the 64-channel Lyon’s auditory model output. The dimension of feature vector is reduced from 256 to 14 using principal component analysis (PCA). The subjective MOS score θj from MOS labeled speech databases is appended to the auditory feature vector Ψ, and used for the training of a joint Gaussian Mixture Model (GMM) using Expectation Maximization algorithm [3] to obtain the parameters of the joint GMM Π(μ(k),ω(k),∑(k)) with k=1,2,3…, M mixture components, where μ(k),ω(k), and ∑(k) are the mean, mixture weight, and covariance matrix respectively of the k-th mixture component. Thus, [Ψj, θj] is the 15-dimensional feature vector for the j-th training utterance, which consists of the 14dimensional auditory feature vector Ψj and the subjective MOS score θj of the speech utterance, where j=1, 2, 3,….., J is the number of speech utterances used for the training of the joint GMM. Now, the aim is to get an objective estimator θˆ for the quality of a speech utterance as a function of the reduced size feature vector ψ i.e., θˆ = θˆ(ψ ) and given the trained joint GMM parameters Π(μ(k),ω(k),∑(k)), the objective MOS estimate θˆ is obtained using the MMSE criterion [2] given by,

III. OBJECTIVE SPEECH QUALITY ASSESSMENT USING AUDITORY FEATURES

Babble Noise, + 5 dB Speech Processing Algorithms

Hu & Loizou Work

Our Subj. Rating

Mean

Std. Dev

Mean

Std. Dev

klt

1.93

0.83

2.31

0.79

klt_jabloun

1.68

0.75

2.40

0.88

stsa

2.46

0.88

2.10

0.80

logstsa

2.53

0.71

2.35

0.82

logstsa_nest

2.29

0.74

2.42

0.76

logstsa_sap_q

1.83

0.80

2.38

0.74

lap

1.72

0.71

2.41

0.74

weuclid

2.34

0.74

2.20

0.74

rdc

1.99

0.74

2.35

0.77

rdc_nest

1.69

0.67

2.31

0.81

mb

2.48

0.85

2.57

0.71

wavthr

1.74

0.78

2.18

0.74

scalart

2.24

0.77

2.17

0.79

tsoukalas

1.55

0.72

2.18

0.80

TABLE 4: Mean and standard deviations of subjective MOS. Babble Noise, + 10dB Speech Processing Algorithms

Hu & Loizou Work

Our Subj. Rating

{

}

Mean

Std. Dev

Mean

Std. Dev

klt

2.47

0.65

2.71

0.88

klt_jabloun

2.36

0.87

2.72

0.91

If N (ψ / μ ( k ) , Σ ( k ) ) are the multivariate Gaussian

stsa

2.88

0.71

2.59

0.80

logstsa

3.03

0.76

2.75

0.90

logstsa_nest

2.92

0.82

2.82

0.96

logstsa_sap_q

2.54

0.77

2.67

0.88

densities, with μ (k ) being the mean vectors and Σ being the covariance matrices of the kth mixture components of Gaussian density. Then, the objective MOS θˆ estimate is given by,

lap

2.37

0.85

2.80

0.98

θˆ = θˆ(ψ ) = arg min E (θ − θˆ(ψ ))2 = E{θ /ψ } θˆ (ψ )

(1)

(k )

K

E {θ /ψ } = ∑ X ( k ) (ψ ) μ θ( k/)ψ

(2)

k =1

weuclid

2.90

0.80

2.62

0.78

rdc

2.57

0.80

2.58

0.87

rdc_nest

2.00

0.79

2.49

0.84

mb

3.10

0.74

2.70

0.90

wavthr

2.49

0.86

2.39

0.77

and

scalart

2.74

0.76

2.57

0.86

( ) k k (k ) ( k ) −1 (Σψψ ) (ψ − μψ ) μθ( /)ψ = μ 0( ) + Σψθ

tsoukalas

2.20

0.83

2.51

0.86

where, X ( k ) (ψ ) =

(

ω ( k ) N ψ / μψ ( k ) , Σψψ ( k ) K

∑ω k =1

(k )

(

N ψ / μψ

(k )

, Σψψ

)

(k )

(3)

) k

(4)

where, μΨ(k), is the mean of feature vector Ψ, μθ (k), is the mean of subjective MOS θ , ∑ΨΨ(k), is the covariance matrix of Ψ, and ∑Ψθ (k) is the cross-covariance matrix of Ψ and θ. In this investigation, M =12 mixture components has been used in the GMM for the modeling of the probability density function for

A similar statistical analysis of the subjective ratings has also been done for NOIZEUS-960 database speech utterances similar to [8]. Mean and standard deviation of MOS are computed for the different speech degradations / noises are

263

the auditory feature vector mapping. For GMM training and objective MOS computation using the auditory feature vector and GMM parameter, “leave one out” procedure is used by randomizing the speech utterances. In this approach, training and testing in 9:1 ratio (ten-fold cross-validation process), has been done and repeated ten times for cross-validation [12]. The subjective MOS and the estimated objective MOS has been given and compared for two speech utterances along with the ITU-T Rec. P.563 in TABLE 5 for illustration.

objective MOS thus obtained are compared with the respective subjective MOS for illustration by computing the correlation between the subjective MOS and the objective MOS. TABLE 6: Correlation coefficients between the subjective MOS and the estimated objective MOS. Data of No. of Different Speech Experiments Files NOIZEUS960 960 NOIZEUS2240 2240 Weighted Average Std. Deviation

TABLE 5: The subjective MOS and the computed objective MOS for sample speech utterances and the absolute errors between the subjective MOS and the estimated objective MOS are in brackets.

Speech Utterances

Speaker

Subj. MOS

sp22_babble_sn5.wav

Male

2.45

sp30_babble_sn5.wav

Female

2.30

Obj. MOS by ITU T Rec. P.563 2.04 (0.41) 1.82 (0.48)

Obj. MOS by Lyon’s Model

n =1 P

∑ (θn − μθ )

n =1

2

0.717

0.778

0.306

0.674

0.529 0.155

0.691 0.076

ACKNOWLEDGEMENT

2.36 (0.09) 2.77 (0.47)

REFERENCES [1] "Methods for subjective determination of transmission quality," ITU-T Rec. P.800, 1996. [2] V. Grancharov, D.Y. Zhao, J. Lindblom, and W.B. Kleijn, "Low complexity non-intrusive speech quality assessment," IEEE Trans. on Audio, Speech and Language Processing, vol. 14, no. 6, pp. 1948-1956, Nov. 2006. [3] A.P. Dempster, N. Laird, and D.B. Rubin, "Maximum likelihood from incomplete data via the EM algorithm," Journal of the Royal Statistical Society. Series B (Methodological), vol. 39, no. 1, pp. 1-38, 1977. [4] M. Narwaria, W. Lin, I.V. McLoughlin, S. Emmanuel and C.L. Tien, "Non-intrusive speech quality assessment with support vector regression," in MMM 2010, LNCS 5916, Springer-Verlag Berlin Heidelberg, 2010, pp. 325-335. [5] R.K. Dubey and A. Kumar, “Non-Intrusive Objective Speech Quality Assessment using a Combination of MFCC, PLP and LSF Features,” in Proc. IEEE Int. Conf. on Signal Processing and Communication, Noida, India, 2013, pp. 297-302. [6] "Single ended method for objective speech quality assessment in narrow-band telephony applications," ITU-T Rec. P.563, 2004. [7] http://www.utdallas.edu/~loizou/speech/noizeus last accessed

P

ρ=

Lyon’s Model

The authors would like to thank Mr. Yi Hu and Dr. Philipos C. Loizou of Department of Electrical Engineering, The University of Texas at Dallas, Richardson, TX750830688, USA for providing the NOIZEUS-2240 database of 2240 speech utterances.

The Karl-Pearson’s correlation coefficient between the estimated objective speech quality MOS score θˆ and the subjective MOS score θ are used as figure of merit for the performance evaluation of different Non-intrusive speech quality assessment techniques in literature. The Karl-Pearson’s correlation coefficient ρ of the subjective MOS θ and the objective estimated MOS θˆ is given by,

∑ (θn − μθ ).(θˆn − μθˆ)

ITU-T Rec. P.563

(5)

P

2 ∑ (θˆn − μθˆ)

n =1

where μ θ and μθˆ are the mean values of the subjective MOS

Feb.2009.

score θ and estimated objective MOS score θˆ respectively and the summation is over the conditions P. Results in terms of Karl-Pearson’s correlation coefficient between the subjective MOS and the estimated objective MOS is computed and given in TABLE 6 for comparison on the speech databases NOIZEUS-960 and NOIZEUS-2240. From TABLE 5 and TABLE 6, inferences are made that Lyon’s auditory model features used for Non-intrusive speech quality assessment perform better than ITU-T Rec. P.563.

[8] Y. Hu and P.C. Loizou, "Subjective comparison and evaluation of speech enhancement algorithms," Speech Communications, Elsevier, vol. 49, no. 7, pp. 588-601, July 2007. [9] Y. Hu and P.C. Loizou, "Evaluation of objective quality measures for speech enhancement," IEEE Trans. on Audio, Speech and Language processing, vol. 16, no. 1, pp. 229-238, 2008. [10] H.G. Hirsch and D. Pearce, "The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions," in 6th Int. Conf. on Spoken Language Processing, Beijing, China, 2000, pp. 181-188. [11] R.F. Lyon, "A computational model of filtering, detection, and compression in the cochlea," in Proc. IEEE Int. Conf. on Acoust., Speech and Signal Processing, Palo Alto, CA, 1982, pp. 1282-1285. [12] R.K. Dubey, “Non-intrusive objective speech quality assessment using features at single and multiple time-scales,” Ph.D Thesis, IIT Delhi, 2014.

IV. CONCLUSIONS The subjective listening test of speech quality assessment has been done for two speech databases namely NOIZEUS-2240 and NOIZEUS-690 for 3200 degraded speech utterances and the importance of subjective listening test is sought in objective speech quality assessment for performance evaluation. An objective speech quality assessment using auditory perception features has been implemented and the

264

3.000 2.500 2.000 1.500 1.000 0.500 0.000

Our Subj. Rating

tsoukalas

scalart

wavthr

mb

rdc_nest

rdc

weuclid

lap

logstsa_sap_q

logstsa_nest

logstsa

stsa

klt_jabloun

klt

Hu & Lo oizou Work

3.500 3.000 2.500 2.000 1.500 1.000 0.500 0.000

Our Su ubj. Rating

tsoukalas

scalart

wavthr

mb

rdc_nest

rdc

weuclid

lap

logstsa_sap_q

logstsa_nest

logstsa

stsa

klt_jabloun

klt

Hu & Loizou L Work

Fig. 2: Bar chart foor the mean values of MOS for babble noise at (a) 5 dB SNR and (bb) 10 dB SNR.

1.000 0.800 0.600 0.400 0.200 0.000 tsoukalas

scalart

wavthr

mb

rdc_nest

rdc

weuclid

lap

logstsa_sap_q

logstsa_nest

logstsa

stsa

klt_jabloun

klt

Hu u & Loizou Work

1.200 1.000 0.800 0.600 0.400 0.200 0.000

Our Subj. Rating

tsoukalas

scalart

wavthr

mb

rdc_nest

rdc

weuclid

lap

logstsa_sap_q

logstsa_nest

logstsa

stsa

klt_jabloun

klt

H & Loizou Work Hu O Subj. Rating Our

Fig. 3: Bar chart for the standard s deviation values of MOS for babble noise at (a) 5 dB SNR and (b) 10 dB SNR.

265

3 2 1 0

Mean Std. Dev

(i)

0 dB Noise

3 2 1 0

Mean Std. Dev

(ii)

5 dB Noise

3 2 1 0

Mean Std. Dev

(iii)

10 dB Noise

3 2 1 0

Mean Std. Dev

(iv)

15 dB Noise

Fig. 4: Bar chart for the mean and standard deviation values of MOS for speech utterances of NOIZEUS-960 database at different SNR.

266