Signal-to-noise ratio adaptive post-filtering method ... - Semantic Scholar

2 downloads 0 Views 624KB Size Report
processing can be applied to the voice signal received from the communication .... rate (AMR) narrow-band codec (3GPP, TS 26.090, 2008) as a part of the ...
Signal-to-noise ratio adaptive post-filtering method for intelligibility enhancement of telephone speech Emma Jokinen,a) Santeri Yrttiaho, and Hannu Pulakka Department of Signal Processing and Acoustics, Aalto University, P.O. Box 13000, Fi-00076 Aalto, Finland

Martti Vainio Institute of Behavioural Sciences, University of Helsinki, P.O. Box 9, Fi-00014 University of Helsinki, Finland

Paavo Alku Department of Signal Processing and Acoustics, Aalto University, P.O. Box 13000, Fi-00076 Aalto, Finland

(Received 18 March 2012; revised 28 August 2012; accepted 16 October 2012) Post-filtering can be utilized to improve the quality and intelligibility of telephone speech. Previous studies have shown that energy reallocation with a high-pass type filter works effectively in improving the intelligibility of speech in difficult noise conditions. The present study introduces a signal-to-noise ratio adaptive post-filtering method that utilizes energy reallocation to transfer energy from the first formant to higher frequencies. The proposed method adapts to the level of the background noise so that, in favorable noise conditions, the post-filter has a flat frequency response and the effect of the post-filtering is increased as the level of the ambient noise increases. The performance of the proposed method is compared with a similar post-filtering algorithm and unprocessed speech in subjective listening tests which evaluate both intelligibility and listener preference. The results indicate that both of the post-filtering methods maintain the quality of speech in negligible noise conditions and are able to provide intelligibility improvement over unprocessed speech in adverse noise conditions. Furthermore, the proposed post-filtering algorithm performs better than the other post-filtering method under evaluation in moderate to difficult noise conditions, where intelligibility improvement is mostly required. C 2012 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4765074] V PACS number(s): 43.72.Kb, 43.60.Dh, 43.60.Mn [MAH]

I. INTRODUCTION

The quality and intelligibility1 of the speech signal can be degraded in many ways in mobile communications; the transmission through the radio channel and the low bit-rate coding are sources of disturbance. Often there is also some level of environmental noise present at the sending or receiving side of the communication channel, referred to as the far end and the near end, respectively. To combat the effects of deterioration of the quality and the intelligibility of speech at the near end, postprocessing can be applied to the voice signal received from the communication channel. The developed post-processing methods can be broadly classified into two categories: (i) noise reduction methods and (ii) post-processing techniques for clean speech signals. Whereas noise reduction algorithms are normally applied when the speech signal is corrupted by far-end environmental noise, post-processing can also be applied to speech signals that contain negligible noise. In this case, relevant acoustical cues of the received speech signal are emphasized in order to improve speech quality and intelligibility when the signal is listened to in noisy near-end conditions. Post-filtering is a traditional example for this type of post-processing that is used in mobile phones to improve the quality of the speech signal by reducing the perceptual degradation caused by low bit-rate speech coders. The prefix post is used to indicate that a)

Author to whom correspondence should be addressed. Electronic mail: [email protected]

3990

J. Acoust. Soc. Am. 132 (6), December 2012

Pages: 3990–4001

the processing is done after the communication channel, whereas pre-filtering refers to filtering that is done at the sending device before transmitting the encoded speech into the channel. In the following, known methods from noise reduction and post-filtering for quality enhancement are first shortly described before moving onto hand-tuned and automated post-processing and post-filtering approaches aimed at intelligibility enhancement. A prevalent type of post-processing is single-channel noise reduction which includes algorithms based on spectral subtraction (Gustafsson et al., 2001), subspace division (Hu and Loizou, 2003), statistical models (Ephraim and Malah, 1985), and Wiener filtering (Hu and Loizou, 2004). These methods have primarily been used for quality enhancement while their performance in intelligibility improvement has remained limited (Hu and Loizou, 2007). Recently, however, Loizou and Kim (2011), as well as Kim and Loizou (2011) have studied reasons why noise reduction techniques demonstrate poor performance with intelligibility enhancement. Their studies suggest that by controlling the amount of amplification distortion caused by the over-estimation of the speech spectrum, large gains in intelligibility can be achieved. Their approach, however, is problematic without access to the clean speech spectra (Kim and Loizou, 2011). This in turn limits the usefulness of their technique in mobile device implementations. Post-filtering is typically achieved by utilizing an adaptive filter to emphasize the peaks in the spectrum and to

0001-4966/2012/132(6)/3990/12/$30.00

C 2012 Acoustical Society of America V

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

de-emphasize the spectral valleys, which effectively attenuates the noise components. Chen and Gersho (1995) introduced a post-filtering method that has a straightforward filter structure and has shown good performance in improving the subjective quality of speech. The algorithm includes a shortterm formant filter, a long-term pitch filter, a tilt correction filter, and automatic gain control. This basic post-filter form, or parts of it, has been adopted widely and several improvements to its structure have been proposed. Grancharov et al. (2008), for instance, modified the basic post-filter to enhance its performance in the presence of background noise. They replaced the post-filter parameters controlling the formant and pitch emphasis with new parameters that were adapted to the statistics of the background noise. The resulting generalized post-filter was shown to improve the performance of the basic post-filter with different noise types and varying signal-to-noise ratios (SNRs). However, the generalized post-filter, which was specifically developed to perform in the presence of acoustic background noise, was evaluated by Grancharov et al. (2008) with relatively high SNRs where the quality of the speech signal is the main focus. Several post-processing methods aimed particularly at intelligibility enhancement of speech have also been introduced. For instance, companding has been shown to improve intelligibility perceived by hearing-impaired listeners (Oxenham et al., 2007; Bhattacharya and Zeng, 2007) and also the enhancement of the transient components in speech has provided promising results (Yoo et al., 2007). However, most of the developed intelligibility enhancement methods are not applicable for post-processing in mobile phones because they require, for instance, annotation of speech by hand or a large amount of computation which cannot be done in real-time. Nevertheless, speech intelligibility is one of the most important factors affecting communication between mobile phone users. For instance, in the presence of severe environmental noise, the quality of the speech signal is no longer the main concern and, therefore, suitable post-processing methods are needed. Some algorithms have been introduced that fulfill the requirements set by the mobile communication framework and have been shown to work with listeners with normal hearing. For instance, Tang and Cooke (2010, 2011) utilized selective boosting of mid-frequency regions of speech with moderate prior SNR under the realistic constraints that the energies and durations of the input and output signals are the same. They found that this approach improved the intelligibility of speech in keyword recognition tests but the effects of the processing on subjective quality were not studied. Sauert and coworkers have presented several slightly modified algorithms (e.g., Sauert and Vary, 2006), which try to find optimal gains for the sub-bands of the unprocessed speech signal in terms of maximizing the speech intelligibility index (SII). The proposed methods have been shown to improve intelligibility as measured with the SII, but the authors have not conducted any subjective tests on the algorithms. Skowronski and Harris (2006) investigated the idea of energy reallocation by transferring energy from voiced sounds to unvoiced sounds while preserving the overall energy. The performance of their method was J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

compared to high-pass filtering and unprocessed speech in a word recognition test which contained multiple vocabulary test sets and two SNR levels. They were able to show that, on average, the two post-processing methods improved intelligibility compared to unprocesssed speech. In post-filtering, energy reallocation can be achieved by using a high-pass filter to transfer energy from a lowfrequency region to higher frequencies. Similar energy reallocation can be observed in Lombard speech, which is produced naturally by speakers in noisy conditions (Summers et al., 1988). Niederjohn and Grotelueschen (1976) used high-pass filtering combined with amplitude compression and showed that this approach produced intelligibility gains in noisy conditions. Hall and Flanagan (2010) studied the effects of differentiation and formant equalization (both methods containing a high-pass type filter) on the intelligibility of telephone speech and concluded that both approaches provided speech that was easier to understand. Even though this kind of energy reallocation has been previously shown to produce intelligibility improvement over unprocessed speech in subjective listening tests, the results have been obtained with limited speech material consisting of single words. Additionally, these kinds of post-filtering methods have not been tested in high SNR conditions, and it is therefore unclear how the processing affects the quality and intelligibility of speech in negligible noise conditions. The present paper introduces a SNR-adaptive postfiltering algorithm for the enhancement of speech in mobile phones in the presence of near-end background noise. Single-channel noise reduction algorithms are not applicable to this scenario because there is negligible noise present in the received signal. Furthermore, the proposed method is designed for difficult background noise conditions, where the quality of the speech signal is no longer the main concern and traditional post-filtering is ineffective. The method uses an adaptive high-pass filter and adaptive gain control to transfer energy from low-frequency regions to higher frequencies. In favorable noise conditions, the post-filter has a nearly flat amplitude response, and the effect of the processing is increased as the SNR decreases. The parameters of the filter structure have been optimized in terms of both subjective quality and intelligibility. In addition, the proposed algorithm has been designed to work in realtime with minimal computational and memory requirements and can thus be implemented in mobile devices. The post-filtering method was compared with unprocessed speech and the formant equalizing post-filter introduced recently by Hall and Flanagan (2010) in comprehensive subjective listening tests which measured both intelligibility and listener preference. The tests were conducted with narrowband speech using two realistic noise types and three SNRs, ranging from negligible background noise to extremely difficult noise conditions. Since narrowband speech is still prevalent in mobile communications, the main focus of this study was on narrowband speech. However, wideband speech is becoming more popular; therefore, the performance of the algorithms with wideband mobile speech was also evaluated with three SNRs using one of the noise conditions. Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

3991

II. METHODS

The proposed SNR-adaptive (SA) post-filtering algorithm is compared to a formant equalizing (FE) post-filter that was recently introduced by Hall and Flanagan (2010). Both methods utilize a high-pass type filter to transfer energy from the region of the first formant to higher frequencies. In addition, both algorithms can be implemented in mobile devices because they do not induce long delays or require much computation. A major difference between the methods is that the SA method adapts to the level of the background noise; in negligible noise, where processing is not required, the post-filter has a flat frequency response whereas in severe noise conditions the energy reallocation is increased to make the processed speech more intelligible. The frequency response of the FE post-filter remains constant in varying background noise conditions.

post-filter parameters are computed based on a smoothed SNR value. There are two different methods for the post-filter parameter adaptation, one for adverse background noise conditions and another for moderate noise conditions. The method of adaptation is different for the two cases to ensure the stability of the post-filter in all conditions. The appropriate adaptation method is determined by comparing the smoothed SNR value to a threshold value. Next, the post-filter, a fifthorder infinite impulse response filter, is constructed and the narrowband speech frame is filtered. After filtering, the energy of the processed speech frame is equalized to the energy level of the unprocessed speech frame with the adaptive gain control (AGC) algorithm used in the adaptive multirate (AMR) narrow-band codec (3GPP, TS 26.090, 2008) as a part of the post-processing chain. The AGC scales the processed speech frame sample-by-sample based on a smoothed estimate of the gain scaling factor for the frame.

A. SA post-filter 1. General description

2. Structure of the post-filter

The flowchart of the post-filtering algorithm is presented in Fig. 1. The incoming speech signal, sIN, is processed in 20-ms frames without overlap. The sampling frequency is 8 kHz throughout the processing chain for narrowband speech. In addition to the incoming speech frame, the algorithm also requires an estimate of the near-end background noise level which is used for the adaptation of the post-filter. The level of the background noise is estimated by computing the energy of each frame from the near-end microphone signal and updating the noise estimate when the voice activity detector indicates that the frame does not contain speech. The update is done by computing the weighted mean of the previous noise estimate and the energy of the frame. The obtained noise estimate along with the energy of the speech frame is used to determine an estimate of the SNR. In order to prevent large changes in the coefficients of the post-filter between consecutive frames in varying noise conditions, the adaptation is done in small steps. If the difference between the SNR value utilized for adaptation in the previous frame and the estimate obtained for the current frame is large, the new

The post-filter consists of three cascaded filters described by z z H2 HTILT ðzÞ: (1) HSA ðzÞ ¼ H1 a a The first two filters, H1 ðz=aÞ and H2 ðz=aÞ, are referred to as the formant filters because they are used to modify the approximate frequency areas of the first (F1) and second (F2) formant, respectively. The transfer function of the filters is Hi

z a

¼

1  2  0:9  cosðhi Þ  1  2  ri  cosðhi Þ 

 2 z 1 þ 0:92  az a  2 ; z 1 þ ri2  az a

i ¼ 1; 2;

(2)

where the frequencies of the formants (in radians) are denoted by hi and the values of ri control whether the formants are amplified or suppressed and by how much. Parameter a is used in the noise adaptation of the filter. Because the goal was to suppress F1 and to amplify F2, the parameters were initially restricted so that 0 < r1  0.9 and 0.9  r2 < 1. Suitable

FIG. 1. Flowchart of the SA algorithm. The incoming narrowband speech frame is denoted by sIN and the post-filtered speech frame is sSA. The computation of the noise estimate is explained in Sec. II A 1. Parameter r1 controls the suppression of the first formant [see Eq. (2)]. Parameter a adjusts the maximum distance of the poles and zeros of the formant filters H1(z) and H2(z) from the origin of the z-plane [see Eq. (1)]. 3992

J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

values for the formant locations hi were determined by computing the averages of the first two formant frequencies from 800 Finnish narrowband sentences (approximately 27 min of speech) from two male and two female speakers. The data is described in more detail in Vainio et al. (2005). The formants were located using an automated procedure that searched for the peaks in the linear prediction (LP) spectrum of voiced speech frames and chose two of them as F1 and F2. The peaks closest to the formants in the previous frame were selected if they were not more than 200 Hz away. In case none of the peaks satisfied this criterion, the formant estimates of the previous frame were used. The obtained average formant locations for the narrowband speech material were F1 ¼ 500 Hz and F2 ¼ 1600 Hz. A subjective listening test with 17 Finnish speaking participants was conducted to determine optimal values for the parameters r1 and r2. In the test, the listeners were asked for their preferred values for six different Finnish narrowband sentences produced by three male and three female speakers. All of the speech samples were played in background noise (car noise, SNR ¼ 5 dB). The listeners were able to listen to one sample at a time and at the same time modify the values of the two parameters by clicking on a graphical user interface. There were no restrictions on how many times the sample could be played. The aim was to obtain processed speech that would be intelligible but would not contain disturbing artifacts. Most of the listeners preferred the suppression of the first formant (average r1 ¼ 0.46) as it made the speech signal stand out better from the background noise as a result of the energy normalization. However, a few listeners did not like the way the processing changed the timbre of the speech, especially with male voices. The strong sharpening of the second formant peak was found to cause audible artifacts such as whistling and, therefore, only slight amplification was most often preferred (average r2 ¼ 0.93). The final part of the post-filter is a tilt filter which is designed to compensate for the possible tilt in the spectrum of the processed speech caused by the cascade of the two formant filters. An incline at higher frequencies can make the processed speech sound unnatural. The tilt filter is a firstorder all-pole filter given as HTILT ðzÞ ¼

1 ; 1  lz1

(3)

where l is the coefficient of a first-order LP analysis computed from the impulse response of the cascaded formant filters.

post-filtering in moderate noise conditions and was motivated by informal listening which suggested that the method could be enhanced slightly. In severe noise conditions (SNR < 0 dB), the effect of the post-filter is strengthened by increasing the suppression of the first formant. This results in increasingly more energy being transferred to higher frequencies from the region of the first formant. A stronger post-filtering effect is achieved by changing the parameter r1 gradually from r1,max ¼ 0.46 closer to r1,min ¼ 0.23. The minimum value is used because the simple tilt filter, computed by a first-order LP analysis, is unable to compensate completely the strong incline in the spectrum obtained with smaller values of r1. The exact value of r1,min was determined by estimating the tilt of the resulting post-filter at high frequencies for several values of r1. This was done by fitting a regression line to the magnitude spectrum of the post-filter between 3.5 kHz and 4 kHz and setting a threshold which the slope value of the regression line was not allowed to exceed. The parameter r1 is linearly interpolated between the maximum and minimum values between 0 and 10 dB. For SNRs below 10 dB, the post-filter is used with the minimum value r1,min. Naturally, when the coefficients of the first formant filter H1(z) change, the coefficient of the tilt filter also has to be updated. In moderate noise conditions (SNR > 0 dB), the neutralization of the post-filter is done by moving the poles and zeros of the cascade of the two formant filters gradually closer to the origin in the z plane using the parameter a. The parameter is interpolated linearly between 1 and 0 when the SNR changes from 0 dB to 10 dB. The post-filter obtained at 10 dB has a nearly flat amplitude response and the effect of the processing is inaudible. As a is interpolated, the tilt of the cascaded formant filters changes. Therefore, the coefficient of the tilt filter must be recomputed when parameter a is changed. The adaptation of the post-filter is designed to prevent sudden, excessive changes in the frequency response of the post-filter between consecutive frames. This is accomplished by smoothing the SNR estimate which is used to compute the new post-filter coefficients. Even though the SNR change could be a result of a change in the level of the speech signal and not the level of the background noise, a sudden, large change in the frequency response of the filter could cause audible artifacts. The smoothing is done by restricting the maximum change of the target SNR between consecutive frames to 2 dB. This value was found sufficient to allow rapid adaptation to changing noise conditions, while preventing audible artifacts. The frequency responses of the post-filter for three different SNRs are shown in Fig. 2.

3. Adaptation to background noise

The parameters of the post-filter are adapted to the level of the background noise using two different methods. The first method is utilized under severe noise conditions (SNR < 0 dB), and the other in moderate noise conditions (SNR > 0 dB). The parameters introduced in Sec. II A 2 are set to correspond to a SNR level of 0 dB even though the parameters were optimized with SNR ¼ 5 dB. This 5-dB offset in the SNR level is used to increase the effect of the J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

4. Adaptation of the post-filter to wideband speech

The post-filtering algorithm has been developed for narrowband speech but can also be adapted to process wideband speech. In this case, the sampling frequency is 16 kHz throughout the processing chain. The adaptation of the postfilter structure is done by changing the values of hi to correspond to the 16 kHz sampling frequency. Otherwise, the post-filter remains the same as in the narrowband case. Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

3993

speech the listener is not able to recognize correctly thus providing a quantitative measure of intelligibility. The second test type was a preference test containing questions regarding the subjective estimate of intelligibility, naturalness, and preference of the processing methods under evaluation. A. Speech material

FIG. 2. Frequency responses of the post-filters used in the subjective listening tests: the SA post-filter for SNRs 20 dB, 0 dB, and 5 dB as well as the FE post-filter. Only the shapes of the amplitude responses of the post-filters are depicted in the figure. The effects of the gain control done after the filtering are not taken into account.

B. FE post-filter

Hall and Flanagan (2010) derived the FE post-filter by inverting the average amplitudes of the first two formants of adult male speakers. In other words, the first formant is suppressed and higher frequencies are amplified by the processing. The post-filter was originally designed for wideband telephone speech with a 22.05-kHz sampling frequency, but in this study it was adapted to the lower 8-kHz or 16-kHz sampling frequency by using the z transform given in the appendix of the original article (Hall and Flanagan, 2010). The frequency response of the FE post-filter is shown in Fig. 2. To facilitate comparison between the two post-filtering methods, maintaining the processing chains of the methods as similar as possible both before and after the post-filtering was desirable. In the original implementation of the FE method, the speech samples were equalized according to ITU-T standard P.56 (ITU-T, P.56, 1993) before being presented to the listeners. Here, the energy levels of the post-filtered frames were equalized to the level of the unprocessed frames with the same AGC algorithm used in the SA method. III. EVALUATION

Subjective listening tests were arranged to evaluate the performances of the two post-filtering algorithms (SA and FE) and to compare them with unprocessed speech (UN) in both narrowband and wideband conditions. The tests consisted of a word-error rate (WER) test and a preference test. The idea of the WER test is to determine the percentage of

The speech material used in the subjective tests was originally developed for the speech reception threshold test by Vainio et al. (2005). The material consists of 400 phonetically balanced sentences in Finnish which have been recorded from four speakers (two males, two females). In the following, the expression speech sample is used to refer to one sentence recorded from one of the speakers. Altogether, the test material contains 800 speech samples that have been calibrated in terms of intelligibility by Vainio et al. (2005). The calibration was done by first determining intelligibility scores for all of the samples from all speakers and then comparing the scores of individual sentences to the average intelligibility score of the entire corpus. The presentation levels of samples with intelligibility scores below the average were increased and those above the average decreased. The whole process was repeated with samples calibrated to the presentation levels obtained in the first round to enhance the accuracy of the final calibration. Both the narrowband and wideband speech samples were obtained using the processing chain depicted in Fig. 3. First, the samples were downsampled from 48 kHz to 16 kHz and then filtered with the MSIN filter (ITU-T, G.191, 2005). The MSIN filter is a high-pass filter which is used to simulate mobile terminal input characteristics. After this, the speech samples were encoded and decoded once with the AMR codec (3GPP, TS 26.104, 2009) at a 12.2 -kbit/s bitrate, and then equalized to 26 dBov with SV56 (ITU-T, G.191, 2005) based on the P.56 standard (ITU-T, P.56, 1993). The unit dBov measures decibels compared to the digital overload signal level (ITU-T Users’ Group on Software Tools, 2005). Next, the samples were processed with one of the post-filters, SA or FE, with the desired SNR used as side information. In case of the unprocessed reference condition this step was bypassed completely. Finally, the samples were equalized with SV56 according to the intelligibility calibration levels. This means that the level of each individual sample was set to 26 dBov added with the calibration term of that sample. Before the final equalization with SV56, speech signals were degraded with two different noise types, car noise and factory noise. The car noise was stationary, low-pass type noise, which is often encountered in real-life situations where mobile phones are used. The factory noise (Varga and Steeneken, 1993) is non-stationary containing short bursts, such as sharp

FIG. 3. Chain used to process the speech samples for the listening tests. The incoming wideband speech signal is denoted by sWB and the noisy test signal by sPF. MSIN is a high-pass filter that simulates mobile station input, AMR refers to adaptive multi-rate coding and decoding, and SV56 is utilized to set the energy level of the sample. All of the operations are explained in more detail in Sec. III A. The blocks having a dashed outline utilize a sampling frequency of 8-kHz in the narrowband condition. Otherwise, all processing is done with a sampling frequency of 16-kHz. 3994

J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

clangs, and it is expected to affect speech perception more severely than stationary car noise of the same intensity level. The wideband test was conducted only with factory noise, whereas in the narrowband test, both noise conditions were used. The samples were presented to the listeners using a 8-kHz and a 16-kHz sampling frequency in the narrowband and wideband conditions, respectively. B. Participants

The listening test used 32 listeners between the ages of 20 yr and 35 yr (mean ¼ 26 yr) for the narrowband condition and 12 listeners between the ages of 21 yr and 48 yr (mean¼ 28 yr) for the wideband condition. They all considered themselves to have normal hearing without any noticeable hearing loss. In the narrowband test, the listeners were divided equally between the two noise conditions, 16 listeners each, whereas in the wideband test, all 12 participated in the single noise condition. All of the listeners were native speakers of Finnish, either university students or staff. C. Test procedure

The narrowband tests were conducted in a quiet office space with Sennheiser (Wedemark, Germany) HDA 200 headphones. The headphones are closed and therefore capable of blocking environmental noise effectively. The speech samples were played with a laptop computer using the machine’s internal sound card. The wideband tests were conducted later, and they were held in a sound-proof listening booth with Sennheiser HD 650 headphones. In the beginning of the test session, the listener was given written instructions explaining the progression of the test. After reading the instructions, the listener was asked to complete a small practice test. The listener was instructed to adjust the volume to a comfortable listening level during the practice test and the chosen volume settings were used throughout the test session. The actual test consisted of a WER test and a preference test, both conducted in three SNR levels: 20 dB, 0 dB, and 5 dB. The SNR was progressively deteriorated from one part to the next, and each time a minimal practice test was given to help the listeners become acquainted with the change in the background noise. Each part of the WER and the preference test contained 30 samples or sample pairs. In the WER test, the speech sample was played once to the listener and he or she was asked to type the sentence on the computer. In the preference test, both of the samples in a pair were from the same speaker and contained the same sentence, only the processing of the two samples was different. The listener was able to listen to both speech samples as many times as he or she desired and was presented with the following questions: (i) (ii) (iii)

Which sample is more intelligible? Which sample sounds more natural? Which sample do you prefer to listen to?

The answer was given by choosing one of the options A, B, or, “No preference.” The listeners were instructed to choose the “No preference” response even if they heard a difference but had no preference. The play buttons for the speech samples were marked with the corresponding letters A and B. J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

The WER score was calculated by comparing the typed answer with the original sentence. Vainio et al. (2005) suggested that for some words multiple forms could be accepted because of their similarity in Finnish. For instance, the present and the past tense of verbs are often almost identical: “puhuu/ puhui” (“speaks/spoke”) or “l€oytyy/l€oytyi” (“is found/was found”). Thus, some words had multiple correct variants that were easily confused but did not change the overall meaning of the statement. Additionally, due to the inflected nature of Finnish, in some cases the stem and the suffix of an inflected word were scored separately. Half of the word score was given for the correct stem and the other half for the correct suffix. The existence of extraneous incorrect words did not affect the scoring, but the ordering of the correct words in the typed answer had to be right in order to achieve a full score. The three processing types under comparison (FE, SA, and UN) were each evaluated ten times in the WER test in each part, half of them for female and half for male speakers. In the preference test, nine processing pairs were compared in all: six pairs with different processing types in different order and three pairs where both processing types were the same, i.e., control pairs. Each part of the test had six control pairs and, as a result, each of the three distinct control pairs was once presented with a male and once with a female speaker. The remaining 24 pairs of each part contained samples with different processing types. They were divided so that each of the six distinct pairs was presented once with each of the four speakers. The presentation order of the sample pairs was completely randomized as were also the sentences used for each pair. The only restriction was that all of the sentences of one listener were unique so as to avoid learning effects. D. Data analysis

All obvious spelling and typing errors found in the answers of the WER test were corrected before the analysis was conducted on the scores. In addition, the scores of the preference test were checked for consistency using the scores given to the control pairs. This was done by computing the average absolute null-pair score over all the listeners in the condition and then comparing the individual averages to that score. All of the listeners were found to be sufficiently consistent in their ratings. The distributions of the responses in the preference tests are shown in Fig. 4. The mean values of the WER and the preference scores across different experimental conditions were compared with a repeated measures analysis of variance (ANOVA). The degrees of freedom (and, thus, the p values) of the ANOVA effects were corrected with the Greenhouse-Geisser epsilon when appropriate. Pairwise post hoc comparisons between mean values were performed with Tukey’s honestly significant difference (HSD) tests. In Sec. IV, all statistically significant ANOVA effects are shown. IV. RESULTS A. Narrowband test

The mean error rates from the WER test are presented in Fig. 5. They were analyzed with an ANOVA where factors “SNR” (0 dB and 5 dB), “method” (FE, SA, and UN), Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

3995

FIG. 5. Results of the narrowband WER tests for both noise types and all speakers shown for three SNR values (20 dB, 0 dB, and 5 dB). The mean WER scores along with the standard errors of the means are shown.

FIG. 4. Distribution of the responses in the preference tests. The percentage of times that each of the methods under comparison (FE, SA, and UN) and the “No preference” response were selected in all of the pairwise comparisons. The results are shown for the narrowband (NB) and wideband (WB) speech conditions with car and factory noise (NB car noise, NB factory noise, and WB factory noise) and for all SNR levels (20 dB, 0 dB, and 5 dB). The different questions in each of the conditions are indicated with the abbreviations INT (intelligibility), NAT (naturalness), and PRE (preference).

rate (WER ¼ 23%) followed by FE (WER ¼ 28%), and UN (WER ¼ 38%). While on the average these differences were statistically significant (p values < 0.01), the differences were observed only when the SNR was reduced to 0 dB and 5 dB (factory noise) or to 5 dB (car noise). Furthermore, the difference in the WER score between FE and SA was less pronounced than the difference in the WER score between UN and the other methods; in the detailed analysis the difference between SA and FE was statistically significant only in the case of factory noise with SNR ¼ 0 dB. The summary scores for intelligibility, naturalness, and preference were calculated as the percentage of times that a given type of processing was preferred in the course of all comparisons. The mean values of the scores along with the standard errors of the means are depicted in Figs. 6–8 for intelligibility, naturalness, and preference, respectively. The means were compared with a repeated measures ANOVA where three factors “SNR” (20 dB, 0 dB, and 5 dB), “method” (FE, SA, and UN), and “speaker gender” (male

and “speaker gender” (male and female) were included along with one categorical predictor, “noise type” (car and factory noise). Because the WER scores at 20 dB are saturated, they were dropped from the statistical analysis, and it was only conducted with two SNR levels: 0 dB and 5 dB. WER scores depended on the noise type [F(1,158) ¼ 578.70, p < 0.001], SNR [F(1,158) ¼ 472.33, p < 0.001], and on the method [F(2,316) ¼ 60.0, p < 0.001]. Interactions were found between SNR and noise type [F(1,158) ¼ 114.72, p < 0.001], SNR and method [F(2,316) ¼ 6.65, p < 0.001], as well as between noise type, SNR, and method [F(2,316) ¼ 4.90, p < 0.01]. The error rate was lower in the case of car noise (WER ¼ 11%) as opposed to factory noise (WER ¼ 48%). The WER score also decreased with increasing SNR (p values for all post hoc comparisons < 0.001) so that scores 43% and 17% were obtained for SNR of 5 dB, and 0 dB, respectively. The effect of SNR was more pronounced in the case of factory noise than in the case of car noise. Of the three speech processing methods, SA received the lowest error

FIG. 6. Summary scores for intelligibility in the narrowband preference test shown for three SNR values (20 dB, 0 dB, and 5 dB). The results are aggregated across male and female speakers. The summary score was calculated as the percentage of times that a given type of speech was preferred in the course of all comparisons. The means and the standard errors of the means are shown.

3996

J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

FIG. 7. Summary scores for naturalness in the narrowband preference test shown for three SNR values (20 dB, 0 dB, and 5 dB). The summary score was calculated as the percentage of times that a given type of speech was preferred in the course of all comparisons. The means and the standard errors of the means are shown.

and female) as well as a categorical predictor, “noise type” (car and factory noise), were included. The summary score for intelligibility depended on the SNR [F(2,60) ¼ 5.75, e ¼ 0.68, p < 0.05], method [F(2,60) ¼ 162.62, p < 0.001], and on the interaction between these two variables [F(4,120) ¼ 69.48, p < 0.001]. In the case of SNR ¼ 20 dB, FE received the highest intelligibility scores (p values < 0.001). For SNR ¼ 0 dB, both FE and SA received higher scores than UN (p values < 0.001). Finally, in the case of SNR ¼ 5 dB, the methods could be ranked with respect to intelligibility scores from highest to lowest in the following order: SA, FE, and UN (p values < 0.001). The intelligibility did not depend on the noise type [F(1,30) ¼ 2.15, p > 0.05]. The summary score for naturalness depended on the SNR [F(2,60) ¼ 3.70, e ¼ 0.76, p < 0.05] and on the interactions between SNR and method [F(4,120) ¼ 6.10, e ¼ 0.61,

FIG. 8. Summary scores for preference in the narrowband preference test shown for three SNR values (20 dB, 0 dB, and 5 dB). The summary score was calculated as the percentage of times that a given type of speech was preferred in the course of all comparisons. The means and the standard errors of the means are shown. J. Acoust. Soc. Am., Vol. 132, No. 6, December 2012

p < 0.01], between method and speaker gender [F(2,60) ¼ 6.18, p < 0.01], and between SNR, method, and speaker gender [F(4,120) ¼ 6.58, p < 0.001]. Also noise type had an effect on naturalness [F(1,30) ¼ 5.44, p < 0.05]; the naturalness scores were higher in the case of car noise (33.6%) than in the case of factory noise (27.6%), but otherwise similar naturalness scores were obtained for both noise types. When the SNR was 20 dB, the FE method received higher naturalness scores than the other methods (p values < 0.001). This effect was more pronounced in the case of female speakers than in the case of male speakers. In case of female speakers, the SA method also received higher naturalness scores than the other methods when the SNR was 0 dB or 5 dB. The summary preference score depended on the method [F(2,60) ¼ 60.51, p < 0.001]. Also, interactions were found between SNR and noise type [F(2,60) ¼ 4.36, e ¼ 0.86, p < 0.05], SNR and method [F(4,120) ¼ 20.94, p < 0.001], method and speaker gender [F(2,60) ¼ 6.65, e ¼ 0.89, p < 0.01], as well as between SNR, method, and speaker gender [F(4,120) ¼ 6.03, p < 0.001]. Although the preference scores were not entirely consistent between the male and female speakers, the dependency of preference on the SNR and on the method was, on average, very similar to that observed in the case of the intelligibility summary score. In fact, the correlation between intelligibility and preference scores was fairly high (r ¼ 0.79). In the case of SNR ¼ 20 dB, FE was preferred more often than the other methods (p values < 0.001). In the case of SNR ¼ 0 dB, both FE and SA were preferred more often than UN (p values < 0.001). Finally, in the case of SNR ¼ 5 dB, the methods could be ranked with respect to preference scores from highest to lowest in the following order: SA, FE, and UN (p values < 0.001). These contrasts appeared to be more prominent in the female speaker data than in the case of male speakers. In addition to the summary scores, the data were also analyzed in more detail in terms of the original pairwise comparisons between the methods. These pairwise comparisons were UN vs SA, UN vs FE, and SA vs FE. The intelligibility, naturalness, and preference ratings were compared against the scores obtained from the control pairs. If the comparison score differed from this reference score, one of the speech samples in the pair could be stated to have been rated consistently higher in intelligibility, naturalness, or preference. All comparisons between identical samples were aggregated into a single variable. Therefore, the ANOVA could be performed using only a single factor: “condition” (19 levels comprising all the combinations of SNR, method, and speaker gender as well as the condition with identical samples). The noise type (car and factory noise) was added as a categorical predictor. The intelligibility score in these original pairwise comparisons depended on condition [F(1,18) ¼ 43.64, p < 0.001] and on the interaction between condition and noise type [F(1,18) ¼ 2.56, e ¼ 0.41, p < 0.05]. The pairs where the difference of the comparison score from the reference score was statistically significant are shown in Tables I and II for car and factory noise, respectively. The pairwise intelligibility scores indicate that the SA and FE methods were rated more intelligible than unprocessed speech. Jokinen et al.: Adaptive post-filtering method

Downloaded 17 Dec 2012 to 128.214.76.18. Redistribution subject to ASA license or copyright; see http://asadl.org/terms

3997

TABLE I. Intelligibility rating in pairwise comparisons for car noise in the narrowband condition. Values near þ1 indicate maximal intelligibility for the latter method and values near 1 indicate maximal intelligibility for the former method. Values near zero indicate that neither of the methods was considered more intelligible than the other. All pairs with statistically significant differences are shown. The p values are calculated with Tukey’s HSD post hoc test.

Condition UN vs SA, 5 dB, female UN vs SA, 0 dB, female UN vs SA, 0 dB, male UN vs SA, 5 dB, male UN vs FE, 0 dB, female UN vs FE, 5 dB, male UN vs FE, 5 dB, female UN vs FE, 0 dB, male UN vs FE, 20 dB, female SA vs FE, 20 dB, female

TABLE III. Preference rating in pairwise comparisons in the narrowband condition. The results are aggregated across noise types (car and factory noise). Values near þ1 indicate preference for the latter method and values near 1 indicate preference for the former method. Values near zero indicate that neither of the methods was preferred over the other. All pairs where one of the methods was consistently preferred against the other are shown. The p values are calculated with Tukey’s HSD post hoc test. Condition

Intelligibility score [1,1]

p

0.92 0.88 0.81 0.78 0.75 0.72 0.72 0.66 0.53 0.53

Suggest Documents