Journal of Medical and Biological Engineering, 32(3): 189-194
189
Predicting the Intelligibility of Cochlear-implant Vocoded Speech from Objective Quality Measure Fei Chen Department of Electrical Engineering, The University of Texas at Dallas, Richardson, Texas 75083, USA Received 8 Jan 2011; Accepted 27 May 2011; doi: 10.5405/jmbe.885
Abstract A reliable measure for predicting the intelligibility of cochlear-implant vocoded speech is desirable as it can be used to guide the development of new speech coding algorithms for cochlear implants. The present study assesses the performance of several speech quality measures in predicting the intelligibility of cochlear-implant vocoded speech (vocoded English and vocoded Mandarin Chinese). The quality measures are analyzed and correlated with the intelligibility scores of vocoded speech obtained from normal‐ hearing listeners in three experiments. The perceptual evaluation of speech quality measure (r = 0.91) and the weighted spectral slope measure (r = -0.87) are well correlated with the intelligibility score. The language effect impacts the performance of objective quality measures in predicting the intelligibility of vocoded speech. Keywords: Vocoded speech, Speech quality measure, Intelligibility prediction
1. Introduction Cochlear implants (CIs) are used to restore partial hearing to patients with severe to profound deafness [1,2]. The electrodes of a CI device are inserted into the scala tympani of the cochlea, bypassing the hair cells and directly stimulating the residual auditory nerves. The amplitude of the electric stimulation is modulated by the envelope of the speech signal processed by the speech processor. There is a large variability in performance among implant users. A number of factors (e.g., electrode insertion depth and placement) may affect performance in quiet and noisy conditions. Unfortunately, it is difficult to evaluate the impact of a given factor on speech perception due to the interaction among factors. Vocoder simulation is widely used for assessing the effects of some of these factors in the absence of patient-specific confounds [3]. In a vocoder simulation, speech is normally processed in a manner similar to that used by the CI speech processor (i.e., temporal envelope information is delivered and fine-structure information is removed) and presented to normal-hearing (NH) listeners for identification. A recent CI development is electric-acoustic stimulation (EAS), in which an electrode array is implanted only partially into the cochlea so as to preserve the residual acoustic hearing (20-60 dB hearing loss (HL) up to 750 Hz and severe-to-profound hearing loss at 1000 Hz and above) that many patients still have * Corresponding author: Fei Chen Tel: +1-972-883-4650 E-mail:
[email protected]
at low frequencies. The benefit of EAS in terms of better speech recognition in noisy environments has been well documented [4]. Vocoder simulation has also been used to assess factors that influence the performance of EAS users [5]. Considering the large algorithmic parametric space and the large number of signal-to-noise ratio (SNR) levels needed to construct psychometric functions in noisy conditions, a large number of listening tests with vocoded speech are often needed to reach reliable conclusions. In a study by Xu et al., a total of 80 test conditions were examined, requiring about 32 hours of testing per listener [6]. Alternatively, a speech intelligibility index could be used to predict the intelligibility of vocoded speech. Such an index can be used to guide the development of new speech processing strategies for cochlear implants. Although a number of such indices (e.g., the articulation index in [7]) are available for predicting (wide-band) speech intelligibility by NH and hearing-impaired listeners, only a few studies have considered indices for vocoded speech [8-10]. Recently, Research reported that the intelligibility of vocoded speech is highly correlated with the perceptual evaluation of speech quality (PESQ) measure, which was originally designed for predicting subjective speech quality [11]. The vocoding mechanism primarily delivers envelope information and eliminates the fine-structure information contained in the speech signal. The envelope information affects the loudness perception by CI users. Most measures originally designed for evaluating speech quality (e.g., PESQ in [11]) are modeled in terms of the differences in loudness or spectral envelopes. The present study thus assesses the performance of speech quality measures in predicting the intelligibility of vocoded speech. The underlying hypothesis is that a measure
J. Med. Biol. Eng., Vol. 32 No. 3 2012
190
that reliably predicts speech distortion (and overall quality) is also highly correlated with the intelligibility of vocoded speech. This is based on the premise that distortion (e.g., that introduced by the vocoding algorithm) degrades speech intelligibility. The aims of the present work are (1) to assess the performance of conventional objective measures (originally designed for predicting speech quality) when applied to predicting the intelligibility of vocoded speech, and (2) to investigate whether the prediction power varies with the language under analysis (i.e., a non-tonal language vs. a tonal language). Intelligibility scores of vocoded (English and Mandarin Chinese) speech were first collected from listening experiments, and subsequently correlated with a number of quality measures to examine their performance in predicting the intelligibility of vocoded speech.
6
24
4
20
Table 2. Filter cutoff (-3 dB) frequencies used for the tone-vocoder and EAS-vocoder processing.
10
10
2.1 Subjects Speech (i.e., vocoded English and vocoded Mandarin Chinese) intelligibility data were collected from three listening experiments using NH listeners as subjects. Experiments 1 and 2 were taken from a study by Chen and Loizou that assessed the contribution of weak consonants to vocoded English speech intelligibility in noisy environments [9]. Experiment 3 was taken from a study by Chen and Loizou on predicting the intelligibility of vocoded Mandarin sentences [12]. All subjects were native speakers of either American English or Mandarin Chinese, and were paid for their participation. Details of the subjects and test conditions for the three experiments are given in Tabel 1. Table 1. Summary of subjects and test conditions involved in the correlation analysis. “M” and “F” denote male and female, respectively. Exp.
English
1
English
2
Mandarin Chinese
3
No. of subjects
Age
Maskers
7 24 ± 3 SSN, (4 M, 3 F) yrs 2-talker 26 ± 8 SSN, 6 (4 M, 2 F) yrs 2-talker 9 28 ± 7 SSN, (5 M, 4 F) yrs 2-talker
2.3 Signal processing The stimuli were presented in two signal processing conditions, namely tone-vocoder and EAS-vocoder. The first processing condition (tone-vocoder) was designed to simulate eight-channel electrical stimulation, and used an eight-channel sinewave-excited vocoder. Signals were first processed through a pre-emphasis filter (2000-Hz cutoff) with a 3 dB/octave roll-off and then band-passed into eight frequency bands between 80 and 6000 Hz (see Table 2) using sixth-order Butterworth filters [5]. The equivalent rectangular bandwidth (ERB) scale was used to allocate the bandwidth in the eight channels. The envelope of the signal was extracted by full-wave rectification and low-pass filtering using a second-order Butterworth filter (400-Hz cutoff). Sinusoids were generated with amplitudes equal to the root-mean-square (RMS) energy of the envelopes (computed every 4 ms) and frequencies equal to the center frequencies of the bandpass filters. The sinusoids of each band were summed and the level of the synthesized speech segment was adjusted to have an RMS value equal to that of the original speech segment. The second processing condition (EAS-vocoder) simulated combined electric-acoustic stimulation. The signal was first low-pass (LP)-filtered to 600 Hz using a sixth-order Butterworth filter. To simulate the effects of EAS for patients with residual hearing below 600 Hz, the LP stimulus was combined with the upper five channels of the eight-channel tone-vocoder (see Table 2).
2. Speech intelligibility data
Vocoded speech
and the second was two equal-level interfering female talkers (2-talker). For the test of vocoded English, the sentences were corrupted by the SSN and 2-talker maskers at -5, 0, and 5 dB SNR levels. For the test of vocoded Mandarin Chinese, the sentences in the tone-vocoder (implementation described in Section 2.3) test condition were corrupted by the SSN and 2-talker maskers at -4, 0, 4, 8, and 12 dB SNR levels, whereas those in the EAS-vocoder test condition were corrupted by the SSN and 2-talker maskers at -4, -2, 0, 2, and 4 dB SNR levels. The above SNR levels were selected to avoid ceiling/floor effects for speech intelligibility data. More information on stimuli can be found in [9] and [12].
Number of conditions Tone- EASvocoder vocoder
2.2 Stimuli The speech material for Experiments 1 and 2 consisted of phonetically balanced English sentences taken from the IEEE database [13,14], and that for Experiment 3 consisted of Mandarin sentences taken from the Sound Express database, a self-paced software program designed for CI and hearing aid users to practice and develop their listening skills [15]. Each English and Chinese sentence contained 8 and 7 words on average, respectively. Two types of masker were used to corrupt the sentences. The first was continuous steady-state noise (SSN), whose long-term spectrum was the same as those of the test sentences,
Channel 1 2 3 4 5 6 7 8
Tone-vocoder Low (Hz) High (Hz) 80 221 221 426 426 724 724 1158 1158 1790 1790 2710 2710 4050 4050 6000
EAS-vocoder Low (Hz) High (Hz) Unprocessed (80-600) 724 1158 1790 2710 4050
1158 1790 2710 4050 6000
These two vocoders (tone-vocoder and EAS-vocoder) have been widely utilized as simulation tools for CI studies, with results well predicting the pattern or trend in the performance observed in CI listeners [1-5]. Henceforth, CI vocoded speech denotes vocoded speech generated by tone-vocoder and EASvocoder.
Predicting Intelligibility of Vocoded Speech
191
Table 3. Correlation coefficient (r) and standard deviations of the error (σe) between sentence recognition scores and various quality measures for vocoded English and vocoded Mandarin Chinese. “T” and “EAS” denote tone-vocoder and EAS-vocoder, respectively.
r 0.91 -0.56 0.69 -0.57 0.63 -0.48 0.69
e (%) 10.0 19.9 17.4 19.7 18.8 21.1 17.5
Vocoded English T e (%) r 0.92 8.0 -0.49 17.6 0.79 12.5 -0.42 18.4 0.75 13.5 -0.85 10.5 0.22 19.7
EAS r 0.90 -0.53 0.71 -0.54 0.61 -0.54 0.68
2.4 Procedure The experiments were performed in a sound-proof room (Acoustic Systems, Inc.) using a PC connected to a TuckerDavis System 3 workstation (Tucker-Davis Technologies, Alachua, FL, USA). Stimuli were played to listeners monaurally through a Sennheiser HD 250 Linear II circumaural headphone at a comfortable listening level. Prior to the test, all subjects participated in a 10-minute training session and listened to a set of tone-vocoded and EAS-vocoded stimuli to familiarize themselves with the testing procedure. During the testing session, the subjects were asked to write down all the words they had heard. Each subject in Experiments 1 and 2 listened to vocoded English stimuli with a total of 30 and 24 test conditions, respectively. Each subject in Experiment 3 listened to vocoded Mandarin Chinese stimuli with a total of 20 test conditions. Twenty sentences were used per condition, and none of the sentences were repeated across the conditions. The order of the test conditions was randomized across subjects. Subjects were given a 5-minute break every 30 minutes during the testing session.
3. Speech quality measures A number of widely used objective speech quality measures were examined in this study for predicting the intelligibility of vocoded speech (vocoded English and vocoded Mandarin Chinese) in noisy conditions. This study investigated the performance of the PESQ measure [11], the linear predictive coding (LPC)-based objective measures including the log-likelihood ratio (LLR), the Itakura-Saito (IS), and the cepstrum (CEP) distance measures [16]. In addition, this work evaluated the performance of the time-domain segmental SNR measure (SNRseg) [17], the frequency-weighted SNR measure (fwSNRseg) [18], and the weighted spectral slope (WSS) measure [19]. The definitions of these measures are summarized in [20] and the Appendix. The above measures are primarily based on the premise that speech quality can be modeled in terms of differences in loudness between the original and processed signals [21] or in terms of differences in the spectral envelopes (e.g., as computed using an LPC model) between the original and processed signals. The PESQ measure, for instance,
e (%) 10.1 19.5 16.2 19.3 18.3 19.3 16.8
Vocoded Mandarin Chinese T EAS e (%) e (%) e (%) r r 19.0 0.18 24.7 0.68 12.3 20.2 -0.10 25.0 -0.47 14.8 20.6 -0.09 25.0 0.33 15.8 19.8 -0.27 24.2 -0.38 15.5 19.6 0.87 12.3 0.79 10.2 10.7 -0.90 10.9 -0.92 6.7 21.0 0.02 25.1 0.10 16.7
T + EAS r 0.48 -0.36 0.31 -0.40 0.42 -0.87 0.24
assesses speech quality by estimating the overall loudness difference between the noise-free and processed signals [11,22].
4. Results Two statistical measures were used to assess the performance of the tested speech quality measures, namely Pearson’s correlation coefficient (r) and an estimate of the standard deviation of the prediction error ( e ). The average intelligibility scores obtained by NH listeners were subjected to correlation analysis with the corresponding values obtained by the objective measures described in the previous section. Scores obtained for a total of 54 and 20 test conditions were included in the correlation analysis for vocoded English and vocoded Mandarin Chinese, respectively (shown in Table 1). A large correlation coefficient r (or a smaller standard deviation of the prediction error e ) suggests a better applicability of the examined measure for predicting the intelligibility of vocoded speech. Figure 1 shows sentence intelligibility scores for 54 vocoded test conditions of English sentences and 20 vocoded test conditions of Mandarin Chinese sentences. Table 3 shows the resulting correlation coefficients and predictions errors for vocoded English and vocoded Mandarin Chinese. For vocoded English, the PESQ measure had the highest correlation (r = 0.91) (see scatter plot in Fig. 1(a)), which confirms the results of Chen and Loizou [9]. The performance of the other conventional objective measures (IS, SNRseg, and fwSNRseg) was good (r = 0.63~0.69), whereas that of the LLR, CEP, and WSS measures was quite poor (r = -0.48~-0.57). r =0.91 (a)(a) r =0.91
r =0.48 (b)(b) r =0.48
100 100
100 100
80 80
80 80
Percent correct (%) Percent correct (%)
PESQ LLR IS CEP SNRseg WSS fwSNRseg
T + EAS
Percent correct (%) Percent correct (%)
Quality measure
60 60
60 60
40 40 20 20 0 0 1 1
40 40
Tone-vocoded Tone-vocoded EAS-vocoded EAS-vocoded 1.51.5 2 2 PESQ PESQ
(a) r = 0.91
2.52.5
20 20 0 0 0 0
Tone-vocoded Tone-vocoded EAS-vocoded EAS-vocoded 0.20.2
0.40.4 0.60.6 PESQ PESQ
0.80.8
(b) r = 0.48
Figure 1. Scatter plots of intelligibility scores versus the predicted PESQ measure for (a) vocoded English and (b) vocoded Mandarin Chinese.
192
J. Med. Biol. Eng., Vol. 32 No. 3 2012
For vocoded Mandarin Chinese, the objective quality measures analyzed in this work performed differently. The PESQ measure yielded a very low correlation (r = 0.48) (see scatter plot in Fig. 1(b)). The WSS measure surprisingly gave a high intelligibility prediction with a correlation of r = -0.87. The other conventional objective measures poorly predicted the intelligibility of vocoded Mandarin Chinese (r = -0.40~0.42). Table 3 also shows the correlation coefficients and prediction errors for each vocoder condition (i.e., tone-vocoder (T in the table) vs. EAS-vocoder (EAS in the table)). The PESQ measure consistently well predicts the intelligibility of both tone-vocoded (r = 0.92) and EAS-vocoded (r = 0.90) English sentences. The WSS measure yielded a high correlation with the intelligibility of tone-vocoded English sentences (r = -0.85). This might be due to the relatively small number of tone-vocoder conditions (i.e., 10) for English in this study. The WSS measure was highly correlated with the intelligibility of Mandarin Chinese sentences when processed by tone-vocoder (r = -0.90) and EAS-vocoder (r = -0.92).
5. Discussion The results in this study are similar to those obtained with wideband (non-vocoded) speech by Ma et al. in [20]. The PESQ measure correlated well (r = 0.91) with the intelligibility score of vocoded English sentences [9]. The PESQ measure, originally designed for predicting the quality of speech transmitted over IP networks [11,22], has been shown to correlate well (r = 0.81) with subjective ratings of speech (i.e., English) distortion introduced by noise-suppression algorithms [18], and performed modestly well (r = 0.77~0.79) in predicting the intelligibility of consonants and sentences in noisy environments [20]. Unlike measures that treat positive and negative loudness differences the same (by squaring the difference), the PESQ measure is based on an elaborate loudness model which treats positive and negative differences differently, as they affect the perceived quality differently. A positive difference indicates that a component has been added to the spectrum, whereas a negative difference indicates that a spectral component has been omitted or heavily attenuated. Compared to added components, omitted components are not as easily perceived due to masking effects, leading to a less objectionable form of distortion. This loudness model seems to be quite robust in terms of reliably predicting the effects of degradations in the signal (such as vocoding distortion) on intelligibility, which might explain why the PESQ measure was the most reliable for predicting speech quality [18] and speech intelligibility of English in this study. Although the PESQ measure well predicted the intelligibility of vocoded English, it was poorly correlated with the intelligibility score of vocoded Mandarin Chinese sentences. Different from English, Mandarin Chinese is a tonal language, which uses four tones to express the lexical meaning of words. The fundamental frequency contour is the dominant cue used for tone recognition in Mandarin. Thus, it is hypothesized that in order to well predict the intelligibility of Mandarin, the
measure should also reliably capture the distortion that degrades tone recognition. The WSS measure captures the difference between spectral slopes in each frequency band, which might be suitable for capturing the effect of degrading tone recognition. Nevertheless, the performance of the WSS measure varied substantially for vocoded English and vocoded Mandarin Chinese. The above finding implies that there is a language effect on the performance of a quality measure in predicting the intelligibility of vocoded speech. For tonal languages, a measure that reliably predicts speech distortion and overall quality (e.g., PESQ) might not be sufficient for assessing speech intelligibility. Further research is required to determine the relation between speech quality and speech intelligibility for tonal languages (e.g., Mandarin Chinese). In investigating the language effect on intelligibility prediction, many factors can lead to differences between the test materials for different languages (i.e., English and Mandarin Chinese in this study), including syntax difficulty, lexical familiarity, loudness level, etc. The experiments conducted in this study attempted to avoid possible interference from these factors on sentence intelligibility. For instance, the test materials (both English and Mandarin Chinese sentences) included phonetically balanced sentences used in daily life, and were pronounced at a medium rate. The loudness levels were set to a comfortable level for speech perception. Subjects were given a 5-minute break every 30 minutes to minimize fatigue during the listening experiments.
6. Conclusion This study assessed the performance of several widely used objective measures in terms of predicting the intelligibility of vocoded speech. Although some indices (e.g., PESQ) were highly correlated with the intelligibility score, most objective measures performed modestly well or poorly in predicting the intelligibility of vocoded speech. The language effect was found to have an impact on the performance of predicting the intelligibility of vocoded speech.
Appendix The LPC-based measures (the LLR, IS, and CEP distance measures) assess the difference between spectral envelopes, as computed by the LPC model, of the input clean signal and the processed signal. The LLR measure is defined as [16]: dLLR (a p , ac ) log(
a p R c a Tp ac R c acT
)
(1)
where ac is the LPC vector of the clean speech signal, a p is the LPC vector of the processed speech signal, and R c is the autocorrelation matrix of the clean speech signal. The IS measure is defined as [16]: c2 a p R c a p 2 ( ) log( c2 ) 1 2 T p ac R c ac p T
dIS (a p , ac )
(2)
2 where c2 and p are the LPC gains of the clean and processed speech signals, respectively.
Predicting Intelligibility of Vocoded Speech
The CEP distance provides an estimate of the log spectral distance between two spectra and is computed as [16]: p 10 2 [cc (k ) c p (k )]2 log10 k 1
d CEP (cc , c p )
(3)
where cc and cp are the CEP coefficient vectors of the clean and processed speech signals, respectively. The time-domain segmental SNR (SNRseg) measure is computed as [17]: SNRseg
10 M 1 log10 M m0
N m N 1
n Nm N m N 1
where a0 = 4.5, a1 = -0.1, and a2 = -0.0309. The range of the PESQ score is -0.5 to 4.5.
References [1] [2] [3]
2
x ( n)
(4)
[4]
where x(n) is the input clean signal, xˆ(n) is the processed signal, N is the frame length, and M is the number of frames in the signal. The frequency-weighted segmental SNR (fwSNRseg) measure is computed using [18]:
[5]
n Nm
( x(n) xˆ (n)) 2
K
fwSNRseg
10 M 1 M m 0
W ( j, m) log
10
j 1
K
W ( j , m)
(5)
1 M
M 1
m0
K j 1
[8]
j 1
where W(j,m) is the weight placed on the j-th frequency band, K is the number of bands, M is the total number of frames in the signal, X(j,m) is the critical-band magnitude (excitation spectrum) of the clean signal in the j-th frequency band at the m-th frame, and Xˆ ( j, m) is the corresponding spectral magnitude of the enhanced signal in the same band. The WSS measure is defined as: d WSS
[6]
[7]
X ( j, m)2 ( X ( j, m) Xˆ ( j, m)) 2
[9]
[10] [11]
WWSS ( j, m) ( Sc ( j , m) S p ( j , m)) 2
K j 1
WWSS ( j , m)
(6)
where WWSS ( j , m) are the weights computed as described in [19], K = 25, M is the number of data segments, and Sc ( j, m) and S p ( j, m) are the spectral slopes for the j-th frequency band of the clean and processed speech signals, respectively. In computing the PESQ measure, the clean and processed signals are first level equalized to a standard listening level, and then filtered by a filter whose response is similar to that of a standard telephone handset. The signals are aligned in time to correct for any time delays, and then processed through an auditory transform to obtain the loudness spectra. The absolute difference between the processed and original loudness spectra is used as a measure of audible error in the next stage of the PESQ computation. Unlike most objective measures which treat positive and negative loudness differences the same (by squaring the difference), the PESQ measure treats them differently, as positive and negative loudness differences affect the perceived quality differently. Consequently, different weights are applied to positive and negative differences. The differences, termed disturbances, between the loudness spectra are computed and averaged over time and frequency to produce the prediction of speech quality. The final PESQ score is computed as a linear combination of the average disturbance value (dsym) and the average asymmetrical disturbance value (dsym) as [11, 22]:
PESQ a0 a1 dsym a2 dasym
(7)
193
[12]
[13]
[14] [15] [16]
[17]
[18]
[19]
[20]
[21]
[22]
P. C. Loizou, “Introduction to cochlear implants,” IEEE Eng. Med. Biol. Mag., 18: 32-42, 1999. F. G. Zeng, “Trends in cochlear implants,” Trends Amplif., 8: 1-34, 2004. R. V. Shannon, F. G. Zeng, V. Kamath, J. Wygonski and M. Ekelid, “Speech recognition with primarily temporal cues,” Science, 270: 303-304, 1995. B. J. Gantz and C. Turner, “Combining acoustic and electric hearing,” Laryngoscope, 113: 1726-1730, 2003. M. K. Qin and A. J. Oxenham, “Effects of introducing unprocessed low-frequency information on the reception of the envelope-vocoder processed speech,” J. Acoust. Soc. Am., 119: 2417-2426, 2006. L. Xu, C. S. Thompson and B. E. Pfingst, “Relative contributions of spectral and temporal cues for phoneme recognition,” J. Acoust. Soc. Am., 117: 3255-3267, 2005. N. R. French and J. C. Steinberg, “Factors governing the intelligibility of speech sounds,” J. Acoust. Soc. Am., 19: 90-119, 1947. R. L. Goldsworthy and J. E. Greenberg, “Analysis of speech-based speech transmission index methods with implications for nonlinear operations,” J. Acoust. Soc. Am., 116: 3679-3689, 2004. F. Chen and P. C. Loizou, “Contribution of consonant landmarks to speech recognition in simulated acoustic-electric hearing,” Ear Hear., 31: 259-267, 2010. F. Chen and P. C. Loizou, “Predicting the intelligibility of vocoded speech,” Ear Hear., 32: 331-338, 2011. A. W. Rix, J. G. Beerends, M. P. Hollier and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ): a new method for speech quality assessment of telephone networks and codecs,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2: 749-752, 2001. F. Chen and P. C. Loizou, “Predicting the intelligibility of vocoded and wideband Mandarin Chinese,” J. Acoust. Soc. Am., 129: 3281-3290, 2011. IEEE Subcommittee, “IEEE recommended practice for speech quality measurements,” IEEE Trans. Audio and Electroacoustics, 17: 225-246, 1969. P. C. Loizou (Ed.), Speech Enhancement: Theory and Practice, Boca Raton, FL: CRC Press, 2007. TigerSpeech Technology, Innovative Speech Software, available: http://www.tigerspeech.com/ S. R. Quackenbush, T. P. Barnwell and M. A. Clements (Eds.), Objective Measures of Speech Quality, Englewood Cliffs, NJ: Prentice-Hall, 1988. J. H. L. Hansen and B. L. Pellom, “An effective quality evaluation protocol for speech enhancement algorithms,” Proc. Int. Conf. Spoken Language Process., 7: 2819-2822, 1998. Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Trans. Audio Speech Lang. Process., 16: 229-238, 2008. D. H. Klatt, “Prediction of perceived phonetic distance from critical-band spectra: a first step,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2: 1278-1281, 1982. J. Ma, Y. Hu and P. C. Loizou, “Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions,” J. Acoust. Soc. Am., 125: 3387-3405, 2009. R. A. Bladon and B. Lindblom, “Modeling the judgment of vowel quality differences,” J. Acoust. Soc. Am., 69: 1414-1422, 1981. ITU-T, “Perceptual evaluation of speech quality (PESQ): an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” ITU-T Recommendation, 862, 2000.
194
J. Med. Biol. Eng., Vol. 32 No. 3 2012