1372
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
Perceptually Inspired Noise-Reduction Method for Binaural Hearing Aids Jorge I. Marin-Hurtado, Member, IEEE, Devangi N. Parikh, Student Member, IEEE, and David V. Anderson, Senior Member, IEEE
Abstract—Different noise-reduction methods have been proposed in the literature for single and multiple-microphone applications. For binaural hearing aids, multiple-microphone noise-reduction methods offer two significant psycho-acoustical advantages with respect to the single-microphone methods. First, noise reduction strategies based on binaural processing (processing using the information received at the left and right ear) are more effective than using independent monaural processing because of the added information. Second, there is an user preference for noise reduction methods that preserve localization cues of both target and interfering signals. Although different multiple-microphone noise-reduction techniques have been proposed in the literature, only a small set is able to preserve the localization cues for both target and interfering signals. This paper proposes a binaural noise-reduction method that preserves the localization cues for both target and interfering signals. The proposed method is based on blind source separation (BSS) followed by a postprocessing technique inspired by a human auditory model. The performance of the proposed method is analyzed using objective and subjective measurements, and compared to existing binaural noise-reduction methods based on BSS and multichannel Wiener filter (MWF). Results show that for some scenarios and conditions, the proposed method outperforms on average the existing methods in terms of noise reduction and provides nearly similar sound quality. Index Terms—Array signal processing, blind source separation (BSS), hearing aids, speech enhancement.
I. INTRODUCTION
A
BINAURAL hearing aid consists of a hearing aid placed on each ear and a wireless link to exchange information. This arrangement provides different user benefits identified through psycho-acoustic studies regarding hearing perception in hostile environments (e.g., babble noise). These psycho-acoustic studies have identified the user preference for binaural noise-reduction strategies and the relevance of the
Manuscript received May 27, 2011; revised October 07, 2011 and November 21, 2011; accepted November 21, 2011. Date of publication December 09, 2011; date of current version February 24, 2012. This work was supported by Texas Instruments, Inc. (formerly National Semiconductor Corporation). The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Sharon Gannot. J. I. Marin-Hurtado is with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA, and also with the Department of Electronics Engineering, Universidad del Quindio, Armenia-Quindio, Colombia (e-mail:
[email protected];
[email protected]). D. N. Parikh and D. V. Anderson are with the School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332 USA (e-mail:
[email protected];
[email protected]). Digital Object Identifier 10.1109/TASL.2011.2179295
preservation of localization cues for target identification and speech intelligibility [1], [2]. These two facts are important for the design of hearing aids that deal with hearing losses present in both ears, i.e., a binaural hearing loss. A typical approach to deal with binaural hearing losses is to use independent monaural hearing aids at each ear. However, the lack of synchronization between both hearing aids may lead to a loss of localization cues, which has been recognized as perceptually annoying [2]–[4]. Hence, binaural hearing aids require multi-microphone noise-reduction methods that are able to reduce the background noise and preserve the original direction of arrival for both the target and interfering signals. These algorithms are the main focus of this paper. Most noise-reduction methods for hearing aids have been designed to enhance only the target signal coming from the front. However, in the last decade, different binaural techniques have been proposed for enhancement of the target signal arriving from any arbitrary direction. These binaural noise-reduction strategies are based on scene analysis [5], [6], spectral subtraction [7], statistical methods [8], beam-forming [9], [10], multichannel Wiener filtering (MWF) [11]–[14], and blind source separation (BSS) [15]–[17]. The majority of the reported binaural noise-reduction techniques preserve the localization cues for the target signal, but only a few of them preserve the localization cues for the interfering signals [9], [10], [12], [13], [16], [17]. In [9], Lotter and Vary proposed a postprocessing method to recover the localization cues at the output of a minimum variance distortionless response (MVDR) beamformer. Rohdenburg et al. [10] improved Lotter and Vary’s algorithm by incorporating the ability to track moving target signals. Lotter and Vary [9], and Rohdenburg et al. [10] showed that their proposed approaches preserve the localization cues for both the target and interfering signals. On the other hand, Doclo et al. [12] proposed a MWF technique that preserves the localization cues for the target and interfering signals. In this case, authors included an extra term to the cost function to preserve the interaural time difference (ITD), which is an important cue used by the human auditory system for localization. Since the MWF method proposed in [12] is computationally expensive, Klasen et al. [13] proposed a simplification known as MWF-N (MWF with partial noise). The ability of MWF-N to preserve simultaneously the localization cues for the target and interfering signals is demonstrated through subjective tests [18] and theoretical analysis [19]. Other binaural noise-reduction methods that preserve the localization cues are based on blind source separation (BSS) [16], [17]. These BSS-based methods differ in the postprocessing to recover the localization
1558-7916/$31.00 © 2011 IEEE
MARIN-HURTADO et al.: PERCEPTUALLY INSPIRED NOISE-REDUCTION METHOD FOR BINAURAL HEARING AIDS
cues. Whereas the method in [16] uses two adaptive filters (Aichner-07), one for the left ear and other for the right ear, the method in [17] uses a Wiener filter (Reindl-10). Reindl et al. showed that Aichner-07 works only for the determined case, i.e., when the number of interfering signals is lower than the number of microphones [17]. Hence, Aichner-07 is unable to provide benefits for complex interfering signals such as babble noise [17]. Moreover, Aichner-07 is claimed to preserve the localization cues for both target and interfering signals; however, different tests conducted in this work (Section IV-C and Appendix B) show that this statement is true only in the determined case. These limitations are overcome by Reindl-10. There are some limitations in the above methods that made them less practical for a binaural hearing aid. First, the SNR improvement achieved by the beamforming techniques in [9] and [10] is very small compared to that of MWF-N [20], and a comparison between MWF-N and a binaural BSS-based noisereduction method has not been published yet. Second, MWF-N and Reindl-10 use block-processing with a large frame length, demanding long processing delay. For a hearing aid, processing delay is a critical parameter. This paper proposes an alternative approach that uses BSS and perceptual postprocessing to provide binaural noise reduction while preserving the localization cues for both target and interfering signals. Moreover, the proposed method is implementable on a sample-by-sample basis, reducing the processing delay. In the proposed method, BSS is used to obtain estimates of the target and interfering signals, and these estimates are employed by the perceptual postprocessing to compute the gains that are applied to the original unprocessed signals. The perceptual postprocessing discussed in [21] is modified so that it can be used for a binaural hearing aid. The proposed processing was introduced in [22], and validated exclusively under non-reverberant scenarios. Experimental evidence about the preservation of localization cues for both target and interfering signals is also discussed in [22]. As an extension of the ideas presented in [22], this paper presents a detailed analysis of the algorithm parameters that control the tradeoff between noise reduction and sound quality, a formal proof about the preservation of the localization cues, and an exhaustive validation for a wide range of scenarios that include reverberant and non-reverberation environments. In this paper, the proposed method is shown to outperform existing BSS-based methods (Aichner-07 and Reindl-10) and MWF-N, in terms of noise reduction; it preserves the localization cues of both target and interfering signals correctly; and its output sound quality is comparable to the existing methods. This paper is organized as follows. Section II describes the proposed method. Section III presents the scenarios and metrics used to verify and compare the performance of the proposed method with other existing methods. Section IV presents and discusses the performance results under no reverberant and reverberant scenarios, and a subjective test. Finally, Section V includes a summary, and the Appendix, a proof that the proposed method preserves the localization cues. II. PROPOSED METHOD The binaural noise reduction method proposed in this paper is a binaural extension of the BSS postprocessing proposed by
1373
Fig. 1. Block diagram of the proposed method.
Parikh and Anderson [21]. In [21], the postprocessing, inspired by an auditory perceptual model, uses an auditory filter-bank to analyze the outputs of a two-channel BSS algorithm. For a given sub-band, the envelope of the primary and secondary channel is used to calculate a noise suppression gain. These gains are applied to the primary BSS output to lower the noise floor by expanding the dynamic range of the sub-band, ensuring a perceptual removal of the noise. Finally, the outputs of each band are combined to produce the final output. To obtain a practical method for binaural hearing aids, the following modifications are introduced. 1) To recover the localization cues, the time-domain gains obtained by the BSS and perceptual postprocessing algorithm are applied to the unprocessed signals received at each side (Fig. 1). Applying the same gains to both sides ensures that the original interaural time differences (ITDs) and interaural level differences (ILDs) remain unmodified in the enhanced signals, and then the localization cues are preserved. 2) To achieve low processing delay, the noise-reduction gains and the output signal are computed in the sample-by-sample basis while the parameters to estimate these gains are updated in the frame-by-frame basis. In the original paper [21], these quantities are computed assuming prior knowledge of the entire signal. 3) To minimize artifacts and to achieve better quality outputs, a long term history for the maximum values of the primary envelope is used instead of the knowledge of the entire signal. 4) The algorithm parameters are computed by a noise power spectral density (PSD) estimator based on the envelopes of the primary and secondary channel. The block diagram proposed for binaural noise reduction is shown in Fig. 1. Signals received at the left, , and right, , microphones are passed through a BSS algorithm to get and . An output selection algorithm identifies which BSS output contains the “unmixed” target signal , or primary channel, and the “unmixed” interfering signal , or secondary channel. These outputs, and , are analyzed using a constant-Q filter-bank, and then the envelope in each sub-band. These envelopes are used to estimate the signal-to-noise ratio (SNR) and then the noise suppression gains for each sub-band. These gains are finally applied simultaneously to the unprocessed signals by time-domain multiplication, and the output from each sub-band is summed together to produce the outputs for the left and right ear.
1374
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
The noise suppression gains are computed to expand the dynamic range of the noisy signal, in such a way that the maximum signal level is maintained while the noise level is pushed down. The maximum signal level, which tracks the envelope of the target speech, is estimated from the primary channel, and the noise level from the secondary channel. Theoretical analysis conducted in [23] show that an ICA-based BSS algorithm1 provides an accurate noise estimate for non-point-source noise scenarios (e.g., diffusive or babble noise). Therefore, the performance of the proposed method under these scenarios is expected to be high. In addition, since the proposed algorithm tracks the envelopes of the target speech and noise level simultaneously, it is expected a good performance under highly nonstationary environments. On the other hand, when the interfering signals are few point sources, the BSS algorithm can provide accurate noise estimation only if the target signal is dominant. Thus, the performance of the proposed algorithm is expected to be low under these scenarios at very low input SNR. Fortunately, these kind of scenarios are uncommon. All the above statements are verified through experiments discussed in the Section IV. Specific details about the algorithm are discussed in the following sections.
separation is usually very small (less than 1 cm). Thus, very small benefits are expected to be achieved in the source separation performed by the BSS algorithm. Hence, the proposed solution, using only one microphone per hearing aid, is practical to deal with low computational cost, low transmission bandwidth, and low power consumption. B. Output Selector In the BSS algorithm, the content of the BSS outputs and depends on the direction of arrival of the target signal. If the target signal is close to the left microphone, the output will the unmixed interfering hold the unmixed target signal, and signal and vice versa. Hence, it is necessary to detect the BSS output that contains the target signal. To avoid the use of a direction of arrival or permutation algorithm because they are computationally expensive, a simple approach based on the comparison of the long term energy for the envelopes of the signals and is used to identify the BSS output holding the target signal. This update takes place every samples. The selection of is discussed in Section III-B. The envelopes of the outputs and are detected by (5)
A. Blind Source Separation An info-max BSS algorithm was chosen for all our experiments because of low computational complexity and short processing delay. In the info-max method used in this paper [22], adaptive FIR filters minimize the mutual information of the system outputs. To achieve this goal, the filter weights are computed based on the knowledge of the cumulative density function (CDF) for the target signal. For the present application, the target signal is speech, therefore, the signal CDF can be modeled by a hyperbolic tangent function. Thus, the BSS block of Fig. 1 is described by (1) (2) (3) (4) and are the signals received at the left and right where microphones, and are column vectors of length describing the unmixing filter coefficients, and and are column vectors of length whose elements are the previous outputs of the BSS algorithm, , , and is the time index. ( kHz) and were used for all our experiments. Modern hearing aids may include two or three microphones per hearing aid. Although the source separation can be improved with the usage of these microphones, transmitting the information of these extra microphones increases the computational complexity, the transmission bandwidth, and so the power consumption. Moreover, on a single hearing aid, the microphone 1ICA stands for independent component analysis. The BSS algorithm used in this paper belongs to this category.
is a time constant. showed to prowhere vide a good performance in our experiments. These envelopes are processed in non-overlapped frames of length . For each th frame, the frame energy is computed and averaged using a first-order estimator. Then, the output with higher time-average energy is selected as primary output . C. Filter-Bank and Envelope Detectors Let and be the vectors of length corresponding to the time-domain unprocessed input signals at the left and right microphones, respectively, and at the frame index . These signals, along with the outputs of the BSS algorithm, and , are passed through a filter-bank that resembles the auditory system. At 22 kHz sampling rate, the signal is decomposed into 24 one-third octave sub-bands using forth-order Butterworth filters. At the output of the filter-banks, the signals and , are obtained, where corresponds to the frame index and to the sub-band number. For each output , the envelope is extracted using a full-wave rectifier followed by a low-pass filter. In particular, is extracted from , and the the primary envelope secondary envelope, from . The low-pass filters are implemented using a first-order IIR filter whose cutoff frequency is selected to be a fraction of the corresponding bandwidth of the band [24]. These cutoff frequencies were set to be 1/5, 1/8, and 1/15 of the bandwidth of low, medium and high frequency bands, respectively. These fractions ensure that the envelope tracks the signal closely but at the same time does not change too rapidly to cause the gain to change rapidly. In addition, these bandwidths also ensure that the localization cues are preserved as will be shown in the Appendix A.
MARIN-HURTADO et al.: PERCEPTUALLY INSPIRED NOISE-REDUCTION METHOD FOR BINAURAL HEARING AIDS
The factors
D. Gain Computation
and
1375
are derived from (9) and (10) as
The final outputs at the left, , and the right, , side are computed using the time-domain gains produced by the perceptual postprocessing stage:
(13) (14)
(6)
where is the factor that describes the amount of expansion to be applied to the signal, the estimated SNR at th sub-band and frame , and , a vector that holds the maximum values of the primary envelopes, at the sub-band , obtained frames: from the previous
where, , is the filter-bank output at the th sub-band, and denotes the element-wise product. The above equation is written in a vector form to be consistent with the mathematical description and to emphasize that the gains are computed using parameters updated on a frame-by-frame basis. However, the gains and output values can also be computed on a sample-bysample basis as will be shown later. In [24], a method inspired by a perceptual model is used to estimate these multiplicative gains. In this method, the gain modifies the envelope of each sub-band such that . To provide noise reduction, and are chosen to expand the dynamic range of signal. In this expansion, the maximum envelope value, assumed to correspond to the target signal, is preserved while the minimum envelope value, which corresponds to the background noise, is lowered to push the noise level below a desired level. In other words,
(7) where [24],
is the expansion coefficient. In the model presented in and are given by (8) (9) (10) (11)
can be interpreted as the long-term signal-to-noise where ratio (SNR) at the th sub-band. The perceptual BSS postprocessing method described in [21] uses the above framework, in which the maximum and minimum envelope values for each sub-band are replaced by the envelopes of the primary channel and secondary channel provided by the BSS algorithm. In other words, becomes , and . According to (8), the gains can be computed on a sample-bysample basis. However, the parameters and must be estimated for the entire signal. To provide a realistic implementation, the proposed method updates and on a frame-by-frame basis every samples. In this case, (12) is the envelope of the primary channel at where the th sub-band and frame index . The (12) is equivalent to (8) but it provides more numerical stability at limited precision.
(15) To avoid computational overflow and to ensure the preservation of the localization cues (see the Appendix), the value of was constrained to be in the range . To minimize artifacts and to achieve better quality outputs, it is necessary to hold a long-term history for the maximum values of the primary envelope. After different tests, we determined that the vector should store the maximum envelopes at least for one second. All experiments use two-seconds memory, . i.e., To estimate the SNR, , at the given sub-band and frame, signal and noise power estimates are obtained from the envelopes of the primary and secondary channel since the primary channel provides an estimate of the target signal, and the secondary channel an estimate of the interfering signals. The signal at the secondary channel may include information about the target signal. Thus, to remove the effect of the primary channel, the noise power is updated using a rule derived from the noise PSD estimator proposed in [25]:
if
end
(16)
is the noise power at the th sub-band and frame is an estimate of the variance of , and are time-constants to smooth the estimation, and is a threshold coefficient. Similarly, the primary channel may contain information of the interfering signals. Hence, the frame SNR is estimated by means of
where
(17) where
is the power of the primary channel estimated by (18)
is imposed to avoid exceptions in (14). The constraint The values and were found to provide good performance in our experiments. The initial values of , and are estimated using the information of the first and second frames.
1376
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
III. EXPERIMENTAL SETUP
[30], and the objective quality assessment measure (PEMO-Q) [31]. SNR-SII is computed by [30] (19)
A. Experiment The performance of the proposed method (BSS-PP) is compared to the existing methods described in [13], [16], [17]. The method in [16] is a BSS-based binaural noise-reduction method that uses two adaptive filters to recover the localization cues (Aichner-07); the method in [17] is another BSS-based binaural noise-reduction method that use a postprocessing Wiener filter (Reindl-10); and the method in [13] is a multichannel Wiener filter that preserves the localization cues for the signal and noise (MWF-N). Aichner-07 and Reindl-10 implementations use the parameters described in [16], [17]. In the MWF-N implementation, the second-order statistics were estimated on-line using a voice activity detector. The MWF-N implementation uses two microphones per hearing aid (microphone separation of 8 mm) in a BTE configuration. In absence of reverberation, four different scenarios were used for testing: diffusive noise, babble noise, single interfering speech signal, and four distinguishable speakers placed at different locations (four interfering signals). Mixtures for these scenarios were created by filtering the target signal with the head related transfer functions (HRTF) measured for a KEMAR manikin in absence of reverberation [26]. The target signal was placed at eight different azimuth angles: 0 , 30 , 90 , 120 , 180 , 240 , 270 , and 330 , where 0 corresponds to the front of the KEMAR, 90 corresponds to the right ear, and 270 to the left ear. Target signals were speech recordings of ten different speakers and sentences taken from the wide-band recordings of the NOIZEUS database [27]. For all scenarios, the interfering signals were added to the target signal at different SNR. For diffusive noise scenario, uncorrelated pink noise sources were played simultaneously at 18 different spatial locations. For the babble noise scenario, the noise source was recorded in a cafeteria2 and added to the speech samples processed with the HRTFs. For the single-interfering scenario, interference was located at 40 , and for the four-interfering scenario, interfering signals are speech samples placed at 40 , 80 , 200 , and 260 . All mixtures are 10–13 seconds long and sampled at 22 kHz. The performance of any noise reduction system is usually degraded when reverberation is present. To analyze the effect of reverberation on the proposed system, the HRTF database described in [28] and [29] was used to generate the test samples. In particular, the experiments assume a babble noise scenario under four different room conditions: studio ( s), meeting room ( s), office ( s), and lecture room ( s). The performance was analyzed using two metrics, the broadband intelligibility weighted SNR improvement ( SNR-SII) 2Real environments include both background noise (as in diffusive noise scenario) and interfering signals (as in four-interfering signals scenario). The babble noise used in the experiments was recorded in a cafeteria, and it includes a mixture of pure background noise (unrecognizable speech) and some speech interfering signals (recognizable speech utterances during short periods of time). Hence, the results for this scenario are close to the performance under a realistic environment.
where corresponds to the left ear, corresponds to the right ear, represents the SNR, in dB, at the th is a weighting factor depending on the frequency bin, and importance of the given frequency bin for speech intelligibility. These weighting factors are taken from [32]. SNR-SII values reported in this paper corresponds to the average over all target speakers and angles. To avoid a false estimation due to transients in the algorithms, SNR-SII values are estimated after 3 seconds. B. Selection of the Parameters in the Proposed Method The performance of the proposed method depends on the parameter and the update frame-length . Since expands the dynamic range of the signal, it is expected that the smaller the better noise reduction. However, choosing a small may lead to degradation in the sound quality. Hence, it is necessary to select an appropriate small enough to achieve good noise-reduction level, and large enough to achieve good sound quality. On the other hand, the noise suppression gains and signal output can be computed on a sample-by-sample basis, and the parameters and required to estimate these gains should be updated every samples. To reduce memory resources in a real-time implementation, the value of must be chosen small, but using a small frame length implies rapid changes in the parameters and , which in turn causes the envelope in each sub-band to change rapidly. These rapidly changing envelope modifications produce modulation of the input signal that leads to distortion in the speech quality. Hence, the value of should be chosen large enough to achieve good quality and small enough to reduce memory-resource requirements. Simulations for all scenarios using different values of and were conducted to determine best values for these parameters. For babble noise scenario (Fig. 2), the SNR-SII improvement is very sensitive to the parameter , and almost independent of the parameter . In particular, values of provide similar SNR-SII improvement, and a noticeable performance reduction. Results for diffusive noise scenario are similar to those of babble noise scenario. For four-interfering (Fig. 3) and single-interfering scenario (behavior similar to that of four-interfering scenario), the performance of the proposed method with respect to the parameter is similar to that of babble noise scenario, i.e., provides good SNR-SII improvement. On the contrary, the SNR-SII improvement depends drastically on the update frame-length . For this particular scenario, the update frame-length should be large, although, no significant improvements were achieved for values of . In addition to the noise reduction, another important concern is the sound quality. PEMO-Q test was employed to measure the quality of the enhanced signals. For babble noise scenario (Fig. 4), provides higher sound quality but lower SNR-SII improvement. Moreover, PEMO-Q scores for
MARIN-HURTADO et al.: PERCEPTUALLY INSPIRED NOISE-REDUCTION METHOD FOR BINAURAL HEARING AIDS
1377
Fig. 5. Objective quality (PEMO-Q) for four-interfering scenario under different parameters and . A value of one corresponds to a clean signal. Fig. 2. SNR-SII improvement for babble noise scenario under different paramand . Input dB. eters
Fig. 6. SNR-SII improvement for diffusive noise (no reverberation).
Fig. 3. SNR-SII improvement for four-interfering scenario under different paand . Input dB. rameters
Fig. 7. SNR-SII improvement for babble noise (no reverberation).
Fig. 4. Objective quality (PEMO-Q) for babble noise scenario under different parameters and . A value of one corresponds to a clean signal.
provides similar sound quality. On the contrary, for the four-interfering scenario (Fig. 5), there is no significant dependence between the PEMO-Q score and . Since the PEMO-Q score for babble noise is higher than that of four-interfering scenario, and provides lower SNR-SII for all scenarios, a value of has been found to be a good choice for all scenarios. On the other hand, the sound quality increases with larger update frame-length , however, no significant improvement in the sound quality was found for under all scenarios. Hence, a value of can provide high SNR-SII improvement and good sound quality. In principle, an update frame-length seems to be impractical for a real-time system such as a digital hearing aid because of memory-resource requirements. However, the proposed framework allows to reduce the memory resources by using sub-frames shorter than . For example, a memory reduction can be achieved by computing and using short sub-frames, and updating the (16) and (18) every samples.
IV. RESULTS AND DISCUSSION This section presents the performance of the proposed method with respect to the SNR-SII improvement under non-reverberant and reverberant scenarios, followed by a subjective test that verifies the effectiveness of the proposed method. A. Performance Under Non-Reverberant Scenarios For comparison purposes, SNR-SII for the proposed method (BSS-PP), Aichner-07, Reindl-10, and MWF-N are plotted in the Figs. 6–9. Simulations of the proposed method use fixed parameters and , which are derived from the analysis in Section III-B. The SNR-SII improvement for all techniques and scenarios decreases with an increasing in the input SNR except for the proposed method, where the SNR-SII improvement increases with an increase in the input SNR. This behavior is due to the dynamic-range expansion provided by the algorithm. In the proposed method, the noise level, , is mapped to . Similarly, in the ideal case, the input SNR, , is mapped as the output SNR . Therefore, the larger the input SNR, the larger the expected output SNR.
1378
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
Fig. 8. SNR-SII improvement for four-interfering signals (no reverberation). The dash line is the performance of the proposed method assuming an ideal output selector.
Fig. 9. SNR-SII improvement for single-interfering signal (no reverberation). The dash line is the performance of the proposed method assuming an ideal output selector.
The proposed method (BSS-PP) outperforms in overall the other existing methods when the number of interfering sources is large (Figs. 6–8). In other words, when the location of the interfering sources is spread in the space or for non-point-source noise. For example, the SNR-SII improvement achieved by the proposed method under the diffusive and babble noise scenario is considerably superior to the other existing methods (Figs. 6 and 7). Although the performance of the proposed method under the four-interfering scenario is very significant, the algorithm fails at input SNR lower than 0 dB (Fig. 8), i.e., when the target signal is not the dominant signal. This limitation is due to the output selection algorithm (Section II-B). This algorithm has been designed to accomplish the selection by comparing the long-term energy of the envelopes at the primary and secondary channels. When the number of interfering signals is high, the envelope of the background noise is almost flat, and this selection is an easy task. However, when the number of interfering signals is small (as in the single-interfering and four-interfering scenarios), the envelope of the background noise looks like the target speech signal. Hence, the output selector algorithm may select the wrong output and enhance the interfering signal instead of the target signal, producing a negative SNR improvement. To show how the output selector algorithm degrades the performance of the proposed method at low input SNR, the performance using an ideal output selector is shown as dash lines in the Figs. 8 and 9. In the ideal case, the performance of the proposed method under the single-interfering scenario is lower than Aichner-07. It is important to remark that non-point-source scenarios (such as babble noise) are challenging for a hearing-impaired
Fig. 10. SNR-SII improvement for babble noise at an input angles corresponds to the direction of arrival of the target signal.
dB. The
listener, while point-source noise scenarios (such as four-interfering signals) at very low input SNR are uncommon. The robustness of the proposed algorithm can be improved by using DoA estimation algorithms (e.g., methods based on sub-space) or using a BSS permutation algorithm at expenses of increasing the computational complexity. The proposed output-selection method is a low complexity solution that provides an acceptable performance for the most challenging and common scenarios. Since the inaccuracy of the output selection algorithm is very unlikely to occur, the proposed algorithm is a good choice for binaural hearing aids. An important property of any binaural noise-reduction system is the ability to provide a symmetric improvement in SNR for all direction of arrivals of the target signal. BSS-PP, Reindl-10, and MWF-N ensure a symmetric SNR-SII (Fig. 10),which in perceptual terms sounds more comfortable and natural. On the contrary, the asymmetric SNR-SII behavior of Aichner-07 is perceptually uncomfortable for the subject as will be explained next. When the direction of arrival of the target signal is at 90 (right ear), Aichner-07 provides a high SNR for the left ear, and a low SNR for the right ear. This means that the background noise is no longer heard at the left ear, but it is still present at the right ear. This asymmetric noise reduction is a result of the adaptive noise cancellation performed by the postprocessing filters. In Aichner-07, the noise estimate is taken from the BSS output where the target signal is not present. Therefore, the noise cancellation will be worse at the side where the target signal is stronger. Up to this point, the performance of the proposed method has been analyzed in terms of SNR improvement. This metric provides information about the ability of the algorithm to reduce the background noise and to enhance the speech signal. However, a high SNR improvement does not necessary imply high sound quality. PEMO-Q scores were used to assess the objective sound quality of the enhanced signals for different scenarios and input SNR (Fig. 11). PEMO-Q scores show that the performance of the proposed method is comparable (or better for some scenarios and conditions) to the existing methods, which suggests that the proposed method provides better noise removal and maintains an acceptable sound quality. This fact is verified through the subjective test in the Section IV-C. B. Performance Under Reverberant Scenarios To analyze how the performance of the proposed and existing noise-reduction methods is perturbed under the presence
MARIN-HURTADO et al.: PERCEPTUALLY INSPIRED NOISE-REDUCTION METHOD FOR BINAURAL HEARING AIDS
1379
Fig. 13. SNR-SII improvement for babble noise scenario in a lecture room (reverberant condition s).
Fig. 11. (a) Objective quality (PEMO-Q) for diffusive noise scenario, (b) babble noise scenario, (c) and our-interfering signals scenario.
Fig. 12. SNR-SII improvement for babble noise scenario under four different dB (Top). Input reverberant conditions, and two input SNRs. Input dB (Bottom).
of reverberation, the SNR improvement was estimated for four different reverberant room conditions using babble noise (Fig. 12). The SNR improvement provided by the proposed method (BSS-PP) is reduced compared to the SNR-SII improvement when no reverberation is present (Figs. 7 and 13). For the proposed method, the SNR-SII for any room analyzed was above 3 dB, which is an acceptable SNR improvement for any noise-reduction method under reverberant conditions. In addition, the SNR-SII provided by BSS-PP is similar or superior to that of the existing methods. For the room with the largest reverberation time (lecture room, s), the behavior of the SNR-SII improvement with respect to the input SNR is shown in Fig. 13. The
proposed method offers a significant advantage over the existing methods since the SNR-SII increases much faster with the increasing of the input SNR. The SNR-SII improvement in the proposed method is consistently superior to the existing methods except at very low input SNR ( dB), in which the SNR-SII improvement is very low compared to the existing methods. The fact that the SNR improvement in the proposed method is consistently superior to the existing methods can be explained by the way in which the noise reduction is performed. In the proposed method, the noise level is reduced by expanding the dynamic range of the unprocessed signals, using a gain that is determined using the maximum and minimum values of the envelopes at the primary channel (estimate of the target signal) and the secondary channel (estimate of the interfering signals). When reverberation is present, the BSS algorithm is unable to provide a good separation of the target and background noise. In this sense, both primary and secondary channels contain reverberant components of the target signal. As a result, the envelope of the primary and secondary channels used to estimate the maximum and minimum envelope required by (13), (16), and (18), are expected to have a value less than or equal to the non-reverberant case. Therefore, a similar or lower expansion is produced, and the SNR-SII improvement is slightly reduced. On the contrary, in Aichner-07, the information of the secondary channel is used as noise estimate for the noise cancellation performed by the adaptive filters. Since this noise estimate contains information about the target signal (present in the reverberant components), it is expected that the adaptive filters attempt to cancel the target signal, degrading the SNR-SII. Reindl-10 and MWF-N are highly dependent on the statistics of the signal and noise. Since the statistics are updated online, when reverberation is present, the noise statistics may have information about the target signal, resulting in a reduction of overall performance. C. Subjective Test A subjective test was conducted to verify the efficiency of the proposed method to reduce the background noise and to preserve the localization cues. The test was a multi-stimulus test with hidden reference and anchor (MUSHRA) according to [33]. The test was composed by two parts. In the first part, the subject was asked to grade the speech quality and noise reduction. For this part, the test samples included clean speech, unprocessed speech in babble noise at an input SNR of 0 dB, and enhanced speech with the Aichner-07, Reindl-10, MWF-N,
1380
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
localization cues of the interfering signal, particularly when the noisy signal has two interfering signals (Appendix B). V. CONCLUSION
Fig. 14. Subjective test results for speech quality (left) and noise reduction (right). Reference: speech in babble noise; anchor: noisy speech distorted according to [33].
This paper describes a noise-reduction method that preserves the localization cues for both target and interfering signals based on blind source separation (BSS) and perceptual postprocessing. The method provides low processing delay since it can be implemented in the sample-by-sample basis with parameters updated in the frame-by-frame basis. The proposed method is compared to existing binaural noise-reduction methods based on BSS and MWF. Using objective and subjective metrics, and different test scenarios under reverberant and non-reverberant conditions at different input SNR values, the proposed method has been shown to provide on an average a better noise reduction for all scenarios analyzed. Its performance is better under hostile environments such as babble noise, and it is worse in environments with a lower number of interfering signals. Although the proposed method outperforms the other methods in terms of noise reduction, the quality of the output sound was rated lower than or similar to that of the existing methods.
APPENDIX A PRESERVATION OF LOCALIZATION CUES IN BSS-PP Fig. 15. Subjective test results for the preservation of the localization cues. Reference: speech with one and two interfering speech signals; anchor: singlechannel noisy speech.
and BSS-PP methods. The reference and hidden reference signal were unprocessed noisy speech while the anchor signal was noisy speech distorted according to [33]. In the second part, the subject was asked to identify the preservation of the direction of arrival of both target and interfering signals. The reference and hidden reference had one or two speech interfering signals at an input SNR of 8 dB. The anchor signal was a single-channel unprocessed signal applied to both ears, i.e., a signal where the target and interfering signals are always heard coming from the front. The other test samples were the signals enhanced with the existing and the proposed method. All test samples had a length of 5 seconds, and they were presented randomly to the subject. For each test part, four group of samples were presented to the subject. A total of 20 normal-hearing subjects participated in the experiment. Results of the subjective test are shown in Figs. 14 and 15. The proposed method (BSS-PP) provides a noise reduction similar to MWF-N and superior to the other existing BSS-based methods (Aichner-07 and Reindl-10). There is a distortion in the speech quality for all methods. Because of the subject preference for the speech quality of the unprocessed noisy signal, the methods that provided the lowest noise reduction (Aichner-07 and Reindl-10) achieved the best speech quality, and the methods with the highest noise reduction (MWF-N and BSS-PP), the lowest speech quality. On the other hand, all methods are able to preserve the localization cues of the target signal, but the Aichner-07 method is unable to preserve the
As mentioned in Section II, the gains are applied simultaneously to the filter-bank outputs from the unprocessed input signals (Fig. 1). In a filter-bank, the acoustic signals received at the left and right side can be expressed by means of the envelopes and excitations at the th sub-band as [24] (20) , or right, , and where represents the channel, left, and are the envelopes and excitations. In the proposed method, the gains modify the envelopes at the th sub-band leaving unmodified the excitations (Section II). These gains can be expressed in terms of the envelopes of the primary channel as
where
. Thus, the final output is given by (21)
The interaural transfer function (ITF) provides an insight of the localization cues. Its magnitude is called interaural level differences (ILD), and its phase, interaural time differences (ITD). To preserve the localization cues, the method should ensure an output ITF similar to the input ITF. These ITFs are defined by the ratios
(22)
MARIN-HURTADO et al.: PERCEPTUALLY INSPIRED NOISE-REDUCTION METHOD FOR BINAURAL HEARING AIDS
The goal is to show that . Suppose a pure tone excitation produced at any arbitrary spatial location. This signal is received at the left and right side as and , respectively, with and the head related transfer functions (HRTFs) that describe the propagation model for the signal to the microphones located to each side. For this signal, (23) When this tone is used in (20),
becomes
only if the frequency is within the th critical band. This result reduces (21) to be (24) In the above equations, the index was replaced by to state that only the th sub-band containing the frequency is active. The above equations show the nonlinear nature of the processing, in which the output signal has a frequency content not limited strictly to the frequency . From (24), the frequency components of the outputs are described by
1381
Proposition 2: is within the critical bandwidth of the auditory filter associated to the frequency only for certain specification of the envelope detector. Proof: as stated in (26) is the Fourier transform of the envelope of the primary channel raised to the power . It is known that the envelope detector provides a low-pass signal of bandwidth , where is usually a very small number. Thus, is expected to be a low-pass signal with a bandwidth approximately equal to . We need to show that lies on the critical bandwidth of each auditory filter. The envelope detectors used in the proposed method employ a low-pass filter whose cutoff frequencies are 1/5, 1/8, and 1/15 of the bandwidth of low, medium, and high frequency bands, respectively (Section II-C). Therefore, the value of can lie in the range [5,15] depending on the band number. with as in (14), the constraint Since ensures the preservation of localization cues. Experiments have been shown that lies in the range [0,3] without any upperbound constraint. Thus, localization cues are preserved.
APPENDIX B PRESERVATION OF LOCALIZATION CUES IN AICHNER-07 In [17], the output of the two-microphone Aichner-07 method in the frequency domain is given by
(25) (27)
where
(28) (29)
(26) The localization cues are preserved if the output ITF derived from (25) is equal to (23) for all frequencies . Proposition 1: Localization cues are preserved if the bandwidth of is within the critical bandwidth of the auditory filter associated to the frequency . Proof: The proposed method uses a filter-bank that resembles the auditory filters, and the final output is constructed by summation of the sub-bands (21). To avoid that the frequency content of a particular sub-band output overlaps other sub-bands, it is necessary that the bandwidth of each sub-band output is within the critical bandwidth of a given auditory filter. If this criterion is met, each sub-band can be analyzed independently. To meet this criterion in (25), has to be within the critical band of the auditory filter associated to the th band containing the frequency . Therefore, the ITF for the frequency region covering the critical band becomes
In a more general case, the total output can be constructed as the linear combination of (25). Hence, if is within the critical band, the above relationship is still valid since a particular th band does not overlap other th band.
is the microphone index; is the th where source ( for the target signal), and is number of interfering sources; is the transfer function from the th source to the th microphone; are the BSS unmixing filters; and is the frequency response of the th adaptive filter. The ITFs for an interfering signal are defined as (30) Replacing (28) in (30),
where
and
is the ITF displacement. In other words, there is a shift in the perceived direction of arrival for each interfering signal. Reference [17] also showed that in the determined case, i.e., when the number of sources is equal to the number of microphones, the interfering signals can be completely removed, and the following conditions are satisfied: (31)
1382
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 4, MAY 2012
which leads to an ITF displacement . Hence, an ITF displacement is expected in the undetermined case since the conditions (31) are not met. REFERENCES [1] J. Jerger, R. Darling, and E. Florin, “Efficacy of the cued-listening task in the evaluation of binaural hearing aids,” J. Amer. Acad. Audiol., vol. 5, no. 5, pp. 279–285, 1994. [2] P. Smith, A. Davis, J. Day, S. Unwin, G. Day, and J. Chalupper, “Real world preferences for linked bilateral processing,” Hear. J., vol. 61, no. 7, pp. 33–38, 2008. [3] T. Van den Bogaert, T. J. Klasen, M. Moonen, L. Van Deun, and J. Wouters, “Horizontal localization with bilateral hearing aids: Without is better than with,” J. Acoust. Soc. Amer., vol. 119, no. 1, pp. 515–526, 2006. [4] B. C. J. Moore, “Binaural sharing of audio signals: Prospective benefits and limitations,” Hear. J., vol. 60, no. 11, pp. 46–48, 2007. [5] T. Wittkop and V. Hohmann, “Strategy-selective noise reduction for binaural digital hearing aids,” Speech Commun., vol. 39, no. 1–2, pp. 111–138, 2003. [6] J. Li, M. Akagi, and Y. Suzuki, “Extension of the two-microphone noise reduction method for binaural hearing aids,” in Proc. Int. Conf. Audio, Lang., Image Process. (ICALIP), 2008, pp. 97–101. [7] A. Kamkar-Parsi and M. Bouchard, “Improved noise power spectrum density estimation for binaural hearing aids operating in a diffuse noise field environment,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 521–533, May 2009. [8] J. Li, S. Sakamoto, S. Hongo, M. Akagi, and Y. Suzuki, “Two-stage binaural speech enhancement with Wiener filter based on equalizationcancellation model,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust. (WASPAA), 2009, pp. 133–136. [9] T. Lotter and P. Vary, “Dual-channel speech enhancement by superdirective beamforming,” EURASIP J. Appl. Signal Process., vol. 2006, pp. 175–175, 2006. [10] T. Rohdenburg, S. Goetze, V. Hohmann, K. Kammeyer, and B. Kollmeier, “Objective perceptual quality assessment for self-steering binaural hearing aid microphone arrays,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2008, pp. 2449–2452. [11] T. Klasen, M. Moonen, T. Van den Bogaert, and J. Wouters, “Preservation of interaural time delay for binaural hearing aids through multi channel Wiener filtering based noise reduction,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2005, pp. 29–32. [12] S. Doclo, R. Dong, T. Klasen, J. Wouters, S. Haykin, and M. Moonen, “Extension of the multi-channel Wiener filter with ITD cues for noise reduction in binaural hearing aids,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust. (WASPAA), Oct. 2005, pp. 70–73. [13] T. Klasen, T. Van den Bogaert, M. Moonen, and J. Wouters, “Binaural noise reduction algorithms for hearing aids that preserve interaural time delay cues,” IEEE Trans. Signal Process., vol. 55, no. 4, pp. 1579–1585, Apr. 2007. [14] S. Doclo, M. Moonen, T. Van den Bogaert, and J. Wouters, “Reducedbandwidth and distributed MWF-based noise reduction algorithms for binaural hearing aids,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 1, pp. 38–51, Jan. 2009. [15] S. Wehr, M. Zourub, R. Aichner, and W. Kellermann, “Post-processing for BSS algorithms to recover spatial cues,” in Proc. Int. Workshop Acoust. Echo Noise Control (IWAENC), 2006. [16] R. Aichner, H. Buchner, M. Zourub, and W. Kellermann, “Multichannel source separation preserving spatial information,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., ICASSP 2007, 2007, vol. 1, pp. I-5–I-8. [17] K. Reindl, Y. Zheng, and W. Kellermann, “Speech enhancement for binaural hearing aids based on blind source separation,” in Proc. Int. Symp. Commun. Control Signal Process. (ISCCSP), 2010, pp. 1–6. [18] T. Van den Bogaert, S. Doclo, J. Wouters, and M. Moonen, “The effect of multimicrophone noise reduction systems on sound source localization by users of binaural hearing aids,” J. Acoust. Soc. Amer., vol. 124, no. 1, pp. 484–497, 2008. [19] B. Cornelis, S. Doclo, T. Van dan Bogaert, M. Moonen, and J. Wouters, “Theoretical analysis of binaural multimicrophone noise reduction techniques,” IEEE Trans. Audio, Speech, Lang. Process., vol. 18, no. 2, pp. 342–355, Feb. 2010. [20] J. I. Marin-Hurtado and D. V. Anderson, “Comparative study of eleven noise reduction techniques for binaural hearing aids,” in Proc. Int. Hear. Aid Res. Conf. (IHCON), Aug. 2010, p. 65. [21] D. N. Parikh and D. V. Anderson, “Blind source separation with perceptual post processing,” in Proc. IEEE 2011 DSP/SPE Workshop, Jan. 2011, pp. 321–325.
[22] J. I. Marin-Hurtado, D. N. Parikh, and D. V. Anderson, “Binaural noise reduction method based on blind source separation and perceptual post processing,” in Proc. Interspeech ’11, 2011, vol. 1, pp. 217–220. [23] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, and K. Shikano, “Blind spatial subtraction array for speech enhancement in noisy environment,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 4, pp. 650–664, May 2009. [24] D. N. Parikh, S. Ravindran, and D. V. Anderson, “Gain adaptation based on signal-to-noise ratio for noise suppression,” in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust. (WASPAA), 2009, pp. 185–188. [25] C. Ris and S. Dupont, “Assessing local noise level estimation methods: Application to noise robust ASR,” Speech Commun., vol. 34, no. 1–2, pp. 141–158, 2001. [26] B. Gardner and K. Martin, “HRTF measurements of a KEMAR dummy head microphone, MIT Media Lab Perceptual Computing, Tech. Rep. 280, 1994 [Online]. Available: http://sound.media.mit.edu/KEMAR. html [27] P. C. Loizou, Speech Enhancement. Theory and Practice. Boca Raton, FL: CRC, 2007. [28] M. Jeub, M. Schafer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. Int. Conf. Digital Signal Process., 2009, pp. 1–5. [29] Aachen Impulse Response (AIR) Database—Version 1.2, 2010, [Online]. Available: http://www.ind.rwth-aachen.de/AIR, RWTH Aachen Univ. [30] J. E. Greenberg, P. M. Peterson, and P. M. Zurek, “Intelligibility weighted measures of speech-to-interference ratio and speech system performance,” J. Acoust. Soc. Amer., vol. 94, no. 5, pp. 3009–3010, 1993. [31] R. Huber and B. Kollmeier, “PEMO-Q a new method for objective audio quality assessment using a model of auditory perception,” IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 6, pp. 1902–1911, Nov. 2006. [32] American National Standard Methods for Calculation of the Speech Intelliblity Index, ANSI S3.5-1997, Acoust. Soc. Amer., 1997. [33] Recommendation BS.1534-1: Method for the Subjective Assessment of Intermediate Quality Levels of Coding Systems, Recommendation BS.1534-1, 2003, ITU-R. Jorge I. Marin-Hurtado (M’01) received the Licenciado degree in electrical engineering and the M.S. degree in applied physics from Universidad del Quindío, Armenia-Quindio, Colombia, in 1997 and 2004, respectively. He is currently pursuing the Ph.D. degree in electrical and computer engineering at the Georgia Institute of Technology, Atlanta. Since 2001, he has been with the Department of Electronics Engineering, Universidad del Quindío, where he is an Assistant Professor. His research interests include signal processing algorithms, DSP hardware systems, and hearing aids. Devangi N. Parikh (S’10) received the B.E. degree in electronics and communication engineering from Gujarat University, Ahmedabad, India in 2006 and the M.S. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta, in 2008. She is currently pursuing the Ph.D. degree in electrical and computer engineering from Georgia Institute of Technology, Atlanta. Her research interests include single- and multichannel speech enhancement and noise suppression algorithms. David V. Anderson (SM’04) received the B.S. and M.S. degrees in electrical engineering from Brigham Young University, Provo, UT, in 1993 and 1994 respectively, and the Ph.D. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, in 1999. Since 1999, he has been with the School of Electrical and Computer Engineering, Georgia Institute of Technology, where he is currently an Associate Professor. His research interests include efficient signal processing algorithms and hardware.