SUBJECTIVE EVALUATION OF SPEECH ... - IEEE Xplore

SUBJECTIVE EVALUATION OF SPEECH ENHANCEMENT ALGORITHMS USING ITU-T P.835 STANDARD Teddy Surya Gunawan, Eliathamby Ambikairajah School of Electrical Engineering and Telecommunications The University of New South Wales Sydney, NSW 2052, Australia [email protected], [email protected] ABSTRACT We report on the subjective comparison of speech enhancement algorithms using the ITU-T P.835 standard. A Matlab toolbox, i.e. P835Tool, has been developed to comply with the ITU P.835 standard. The ITU-T P.835 methodology was designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. The subjective comparison of five speech enhancement algorithms is presented. Results show the effectiveness of our developed toolbox for subjective comparison of speech enhancement algorithms. Furthermore, a new performance index based on subjective (ITU-T P.835) and objective (PESQ, ITU-T P.862) evaluations has been proposed. 1.

INTRODUCTION

The problem of enhancing speech degraded by uncorrelated additive noise, when only the noisy speech is available, has been widely studied in the past and it is still an active field of research. Although various speech enhancement algorithms have been proposed, it still remains unclear as to which algorithm performs well in real-world listening situations where the noise level and characteristics are always changing. Various objective measures have been developed to evaluate the performance of the speech processing systems [1], i.e. Itakura-Saito distortion, Articulation Index, Segmental SNR, and SNR that have been correlated to subjective tests at 59%, 67%, 77%, and 24%, respectively [1]. On the other hand, PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862) that has a 93.5% correlation with subjective tests [2], has also been used for objective evaluation of speech enhancement algorithms [3-5]. Despite the advance in objective measures, subjective evaluation is still required as there is no objective measure that has yet matched the complex accuracy of the human auditory system. In 2003 a subjective test methodology for evaluating noise suppression algorithms has been standardised as ITUT P.835[6]. The methodology uses separate rating scales to

1-4244-0411-8/06/$20.00 ©2006 IEEE.

independently estimate the subjective quality of the speech signal alone, the background noise alone, and overall quality. The listener’s uncertainty is reduced and the reliability is increased by using the above three dimensions of subjective speech quality. In this paper, we report on the development of a Matlab toolbox, P835Tool, suitable for subjective evaluation of speech enhancement algorithms. The main design goals of the toolbox are listed as follows: • The toolbox is designed to adhere to the ITU P.835 standard as much as possible. • The toolbox should be simple to learn and easy to use. • The toolbox should be useful for both research and education. The developed toolbox is then employed to perform subjective tests on the enhanced speech presented in [4]. Twelve subjects took part in the listening tests. The subjective test results were then correlated with the objective test reported in [4]. Furthermore, a new performance index that combined subjective and objective evaluation is proposed to fairly evaluate speech enhancement algorithms. 2.

DESCRIPTION OF ITU-T P.835 STANDARD

The ITU-T P.835 [6] explains speech material, preprocessing of the speech and noise files, listening level, speech sample presentation, and data analysis. In this paper, we will concentrate on the speech sample presentation part, i.e. how to design the trial for subjective evaluation of speech only, noise only, and overall quality. Consider a speech signal s (n ) corrupted by additive stationary background noise v(n ) . The noisy speech can be expressed as follows: y (n ) = s(n ) + v(n )

(1)

The Speech Enhancement Algorithm (SEA) is trying to estimate s (n ) by removing the noise v(n ) from the noisy speech y (n ) . The enhanced speech sˆ(n) then passes through the P835Tool as illustrated in Fig. 1. The scores for

speech only, noise only and overall quality are then recorded.

intrusive, (3) noticeable but not intrusive, (4) slightly noticeable, and (5) not noticeable. For the third sub-sample in each trial, subjects are instructed to listen to the speech plus background noise and rate it on the five category scale. The scale is Mean Opinion Score (MOS) used with the Absolute Category Rating (ACR) [7] as follows: (1) bad, (2) poor, (3) fair, (4) good, and (5) excellent. Listeners are required to register responses prior to the subsequent presentation of a new stimulus. The scale to be used by subjects, either Configuration 1 (SIG-BAK-OVRL) or Configuration 2 (BAK-SIG-OVRL), should be made apparent for each sub-sample presentation. 3.

Figure 1. The usage of P835Tool to evaluate the enhanced speech

Each trial, according to ITU-T P.835, contains a threesentence speech sample to measure speech, noise, and overall quality. Each sample is then followed by a silent voting period. The ITU-T P.835 trial configuration is illustrated in Fig. 2. For the first two sub-samples, listeners rate either the signal or the background depending on the rating scale order specified for that trial. To control the effects of rating scale order, the order should be balanced across the experiment, i.e., scale order should be “Signal, Background, Overall Effect” (Configuration 1) for half of the trials, and “Background, Signal, Overall Effect” (Configuration 2) for the other half.

STRUCTURE OF THE P835TOOL TOOLBOX

The P835Tool toolkit consists of two main parts, i.e. configuration and listening test parts. The configuration part is implemented in p835config.m file, while the listening tests part is implemented in four Matlab files, i.e. p835.m, tmodule.m, tts.m and volume.m.

Figure 3. Initialisation of P835Tool

Fig. 3 shows the initialisation process of P835Tool by executing p835config.m. The input parameters to the function are NMethods and NFiles. Suppose we evaluate five speech enhancement algorithms, two speech files with four noise files and three SNR conditions. In this case, the and the parameter parameter NMethods = 5 NFiles = 2 × 4 × 3 = 24 . The function will input the name and location of the processed speech files and will store them in the p835config.txt file. Figure 2. Two possible configurations for a P.835 trial

For rating the signal, subjects are instructed to only attend to the speech signal and rate the speech on the fivecategory distortion scale. The description of the scale is as follows: (1) very distorted, (2) fairly distorted, (3) somewhat distorted, (4) slightly distorted, and (5) not distorted. For rating the background, subjects are instructed to only attend to the background and rate the background on the five-category intrusiveness scale. The description of the scale is as follows: (1) very intrusive, (2) somewhat

Figure 4. Structure of P835Tool actual listening test

Fig. 4 shows the actual listening test of P835Tool by executing p835.m. Two main Matlab function are p835.m and tmodule.m files. Two optional files are tts.m for the text-to-speech function of the listening test instruction and volume.m for controlling the listening test volume control. There are three modes of operation, i.e. “training phase”, “evaluation phase”, and “all phase” (combination of training and evaluation phases). The “training” mode will start the training as required before the subject does the actual listening test to familiarise the user with the toolbox and its grading system. The “evaluation mode” will perform the actual listening test. Finally, the “all” mode will do both training and evaluation phases. The input WAV files are taken from the preconfigured file, i.e. p835config.txt. The toolbox will then randomise the order of the methods and the files. Moreover, as required in the standard, half of the trial will have Configuration 1 and the other half will have Configuration 2. The results of the listening test will be stored in p835result.txt.

subtraction with minimum statistics, SSMS [9], speech boosting, SB [10], speech boosting using forward masking model 1, SBFM1 [3], and speech boosting using forward masking 2, SBFM2 [4]. Six speech files were taken from the EBU SQAM data set [11]. Twelve noise files were taken from NOISEX-92 [12] and AURORA [13] database. Moreover, the variance of noise has been adjusted to obtain -5, 0, 5 and 10 dB SNRs. Hence, it produces 1440 processed speech files. PESQ [2] was utilised for objective evaluation of the five speech enhancement algorithms. In order to complement and to verify the objective evaluation reported in [4], subjective evaluation using the P835Tool developed was performed. A subset of the 1440 files was selected to reduce the length of the subjective evaluations. Furthermore, a high quality headphone, i.e. Sony MDRV700DJ, was utilised for the listening tests. A sentence spoken by an English male speaker is corrupted in three background noise environments (car, factory, and train noises) at two levels of SNR (5dB and 10dB). The files are processed using five speech enhancement algorithms [4]. A total of 30 processed files are presented to twelve listeners for evaluation. Hence, each subject is required to rate 90 times. The process of rating the signal (SIG) and background (BAK) of noisy speech in a P.835 trial was designed to lead the listener to integrate the effects of both the signal and background in making their ratings of overall (OVRL) quality. 5.

Figure 5. Screenshot of Signal Score

Fig. 5 show the actual listening test for scoring the signal. The same layout is applied for scoring the noise distortion and for scoring the overall enhanced speech with appropriate scoring scales. The Play/Stop button is used for playing and stopping the speech file. The listeners can play or stop the speech file as long as they like. The score is then entered by the listener using either the slider or the textbox. Note that, the slider and the textbox will only be active if the listener has played the file. Finally, the score is then saved to the result file. The listening test is then repeated for other files and/or other methods until all the speech files have been graded. 4.

SUBJECTIVE EVALUATION

In [4], five speech enhancement algorithms were evaluated using PESQ, i.e. spectral subtraction, SS [8], spectral

RESULTS AND DISCUSSION

Fig. 7 shows the mean scores (MOS) for SIG, BAK, and OVRL scale for the five speech enhancement algorithms, three background noise at two levels of SNR. The mean scores for the noisy speech (unprocessed) files are also shown for reference. Of the five speech enhancement algorithms examined, the speech booster technique [10] and speech booster exploiting forward masking model 2 [4] performed comparably better in OVRL scale across all SNR conditions and three types of noise. These overall subjective results will be further compared with the objective results obtained using PESQ. Lower signal distortion, i.e. higher SIG scores, were observed with algorithms that used speech boosting techniques, i.e. SB, SBFM1, and SBFM2. In the speech boosting technique, many listeners found that the speech signal is only slightly distorted compare to other speech enhancement algorithms, i.e. SS and SSMS. Lower noise distortion, i.e. higher BAK scores, were obtained again with algorithms that used speech boosting techniques. In the speech boosting technique, the speech part is enhanced while the noise part is kept. So, human listeners prefer natural noise rather than reduced but distorted noise found in SS and SSMS algorithms. Fig. 6 shows the comparison of subjective scores and

objective scores of five speech enhancement algorithms averaged over two SNR conditions and three types of noise. The correlation between subjective and objective test was found to be 81.9%. Note that, the correlation score is lower than the score reported in PESQ, i.e. 93.5 %. This could be due to the evaluation of speech files on very low SNR, i.e. 5 and 10 dB SNR.

Figure 6. Subjective Scores, Objective Scores and Combined Scores for various speech enhancement algorithms

In the objective scores reported in [4], it was found that the SBFM2 algorithm performs better than other algorithms. However, in the subjective scores (OVRL scores), it was found that SB and SBFM2 have comparable scores. Therefore, to fairly combine the objective and subjective score, we proposed the new combined score as follows: CS = ϑ ⋅ SS + (1 − ϑ ) ⋅ OS

(2)

where CS is the combined scores, SS is the subjective scores, OS is the objective scores, and ϑ is the subjective test trusting factor, i.e. 0.2 ≤ ϑ ≤ 0.8 . For example, if the number of subjects doing the listening test is equal or more than 32 as defined by the ITU-T P.835 then we can set ϑ = 0.8 . The more strictly we comply with the ITU-T P.835 standard the higher the value of ϑ . However, the inherent problem with subjective measurement is that it is time consuming, expensive, and there is a lack of repeatability. Therefore, equation (2) provides a fair combination of the subjective and objective measurements. Fig. 6 shows the combined scores for ϑ = 0.5 . From the figure, it is shown that SBFM2 outperforms other speech enhancement algorithms. Hence, this further confirms the experimental results reported in [4]. 6.

CONCLUSIONS

A Matlab toolbox for subjective evaluation of various speech enhancement algorithms was introduced in this paper. A total of 30 enhanced speech files from five algorithms were presented to twelve listeners for evaluation. The process of rating the signal (SIG) and background (BAK) of noisy speech in a P.835 trial was designed to lead the listener to integrate the effects of both the signal and background in making their ratings of overall (OVRL) quality. We proposed a new combined score to

fairly integrate the subjective scores and objective scores. Using the new score, it was found that SBFM2 algorithm outperforms other speech enhancement algorithms across many types and intensities of noise. 7.

REFERENCES

[1] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements, Objective Measures of Speech Quality. Englewood Cliffs, Prentice Hall, 1988. [2] ITU, "ITU-T P.862, Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs," International Telecommunication Union, Geneva 2001. [3] T. S. Gunawan and E. Ambikairajah, "Speech enhancement using temporal masking and fractional bark gammatone filters," in 10th International Conference on Speech Science & Technology, Sydney, pp. 420-425, 2004. [4] T. S. Gunawan and E. Ambikairajah, "A new forward masking model for speech enhancement," in IEEE International Conference on Acoustics, Speech, and Signal Processing, Toulose, France, pp. 149-152, 2006. [5] T. S. Gunawan and E. Ambikairajah, "Single Channel Speech Enhancement using Temporal Masking," in 9th IEEE International Conference on Communication Systems (ICCS 2004), Singapore, pp. 250-254, 2004. [6] ITU, "ITU-T P.835, Subjective Test Methodology for Evaluating Speech Communication Systems that Include Noise Suppression Algorithm," International Telecommunication Union 2003. [7] ITU, "ITU-T P.800, Methods for Subjective Determination of Transmission Quality," International Telecommunication Union 1996. [8] S. F. Boll, "Suppresion of Acoustic Noise in Speech Using Spectral Subtraction," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, pp. 113-120, 1979. [9] R. Martin, "Spectral Subtraction Based on Minimum Statistics," in Europe Signal Processing Conference, Edinburgh, Scotland, pp. 1182-1185, 1994. [10] N. Westerlund, Applied Speech Enhancement for Personal Communication, PhD Thesis, Ronneby, Blekinge Institute of Technology, 2003. [11] EBU, "Sound quality assessment material recordings for subjective tests," European Broadcasting Union April 1988. [12] A. P. Varga, H. J. M. Steeneken, M. Tomlinson, and D. Jones, "The NOISEX-92 study on the effect of additive noise on automatic speech recognition," Technical Report, Speech Research Unit, Defence Research Agency, Malvern, United Kingdom 1992. [13] H. Hirsch and D. Pearce, "The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions," in ISCA ITRW, Paris, France, pp., 2000.

(a)

(d)

(b)

(e)

(c)

(f)

Figure 7. Subjective test results (SIG=Signal, BAK=Background Noise, and OVRL=Overall) for SNR 5 dB (a, b,and c) and 10 dB (d, e, and f)