Cepstrum Prefiltering for Binaural Source Localization in ... - IEEE Xplore

5 downloads 0 Views 2MB Size Report
head and torso. In particular, two physical cues can be exploited,. i.e. the Interaural Time Difference (ITD) and the Interaural Level. Difference (ILD). It is known ...
IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012

99

Cepstrum Prefiltering for Binaural Source Localization in Reverberant Environments Raffaele Parisi, Senior Member, IEEE, Flavia Camoes, Michele Scarpiniti, Member, IEEE, and Aurelio Uncini, Member, IEEE

Abstract—Binaural sound source localization can be performed by imitation of the fundamental mechanisms of the human auditory system, which is based on the integrated effects of ear, pinnae, head and torso. In particular, two physical cues can be exploited, i.e. the Interaural Time Difference (ITD) and the Interaural Level Difference (ILD). It is known that joint use of ITD and ILD provides good source azimuth estimations [1]. In many practical situations binaural localization has to be performed in closed environments, where the presence of reverberation degrades the performance of available position estimators. In this paper a possible solution to this difficult problem is introduced. The proposed solution is based on proper use of cepstral prefiltering prior to source localization by ITD and ILD. It is shown that cepstrum can help in reducing the effects of reverberation, thus yielding better location estimates. Index Terms—Binaural sound localization, cepstral filtering, reverberation.

I. INTRODUCTION

H

UMAN beings are able to localize sound sources with great accuracy and in the presence of different environmental conditions [2]. This fact has suggested the imitation of the mechanisms of the human auditory system in order to realize effective artificial binaural source localization systems. The field of possible applications is vast and include for instance the design of hearing aids, interactive robotics and augmented reality audio [1]. Recently a significant number of models of the human auditory system have been proposed [3]. Among the possible approaches, those based on estimation of the Interaural Level Difference (ILD) and the Interaural Time Difference (ITD) are often referenced to [2], [4]. ILD is proportional to the difference in the sound levels reaching the left and right ear, while ITD is the measure of the time difference of arrival between signals at each ear. These cues can separately give information on the position of the source with respect to the listener for a specified frequency range. More specifically, variations of the ILD values due to the shadowing effects originated by the head and the torso can give

Manuscript received October 24, 2011; revised December 08, 2011; accepted December 08, 2011. Date of publication December 19, 2011; date of current version January 09, 2012. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Mads Graseboll Christensen. The authors are with the Department of Information, Electronics and Telecommunications (DIET), University of Rome La Sapienza, 00184 Roma, Italy (e-mail: [email protected]). Digital Object Identifier 10.1109/LSP.2011.2180376

information on the source position. This effect occurs especially for frequencies above 1.5 kHz, where the size of the head is large with respect to the wavelength of the signal [3]. On the other side, ITD directly depends on the source signal direction of arrival by a geometrical relationship which is based on a spherical model of the head. This relationship is valid only for a certain range of frequencies approximately below 1.5 kHz, assuming that unambiguous decoding of periodic signals is assured [3]. As a matter of fact, ITD yields position estimations with smaller standard deviation compared to ILD, but have azimuth ambiguity caused by the a priori unknown phase unwrapping factor . This fact suggested combination of both type of cues to build a novel binaural localization method [1]. Unfortunately in closed environments the presence of even moderate reverberation can originate gross localization errors. In these cases reverberation typically induces self-masking and overlap masking of phonemes, thus making unpractical the reference to early reflections [3]. This requires proper preprocessing of signals [5]. Cepstrum analysis can be successfully employed in order to perform this task. II. MODEL DESCRIPTION Signals received by the left and right ears in a reverberant environment can be modelled in the discrete-time domain as (1) (2) where is the impulse response between the source and the ear (the binaural room impulse response, BRIR), is the sound signal emitted by the source and is the corresponding uncorrelated noise term. The impulse response takes into account two independent effects. The first effect depends on the acoustics of the room (i.e. reverberation, [6]). The second effect takes into account the head directional filtering, thus weighing the arriving sound components according to their direction of arrival. The term represents the contribution of additive noise, which is usually modelled as an uncorrelated, zero-mean, stationary Gaussian random process. III. AZIMUTH ESTIMATION VIA ILD AND ITD Binaural sound localization can be performed by following the approach described in [1], that in this section is briefly recalled. It should be remarked that the interest is on azimuth estimation only. Elevation estimation is not considered.

1070-9908/$26.00 © 2011 IEEE

100

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012

ILD and ITD for the generic th time-frame of acquired signals can be defined as

(3)

In this formula is the Fourier transform of , is the complex logarithm operator, is the inverse Fourier transform and is quefrency. Convolution of two signals in the time domain corresponds to an addition in the quefrency domain, so that application of the cepstral transformation to (1) and (2) leads to

(4) and are the Short Time where is frequency, Fourier Transforms (STFTs) of the right and left ear signals and is the phase unwrapping factor, which is unknown [7]. The azimuth of the source can be obtained by comparing the estimated ILD and ITD with a reference set built by exploitation of Head Related Transfer Functions (HRTFs) [8]. In this case, (3) and (4) are written as (5) (6) and are the HRTF functions on the right where and left ears respectively and is the azimuth angle. In particular, smoothing across azimuth is performed on the ILD lookup set in order to model the limits of human interaural level difference perception [1]. More specifically, a Gaussian filter with a constant can be employed, as indicated in the CIPIC database [8]. Localization can be performed using the ILD and ITD sets of (5) and (6) as a reference for the lookup algorithm. The ILDonly azimuth of the source placement can be estimated as the absolute value of the difference between the ILD lookup set and the ILD calculated with the real signal arriving at the ears. The minimum difference across available frequencies corresponds to the estimated azimuth of the source. ITD-only azimuth localization requires an analogous procedure, with the addition of the phase unwrapping module. In particular for each STFT time frame the difference between the ITD lookup set and the ITD experimental data is computed across azimuth for each possible value of the unwrapping factor . The correct is then selected by minimizing the difference between the ITD-only and ILD-only estimates. This -estimation procedure is repeated for each available time frame. A time average across frames is performed. The final azimuth estimations selected are those displaying a minimum in the difference function that is consistent across frequencies. The described procedure can be preceded by application of cepstral prefiltering. This approach is briefly recalled in the next section. IV. CEPSTRAL PREFILTERING Cepstral prefiltering was shown to be effective in reducing the effects of reverberation on received signals [9],[10]. The complex cepstrum of the signal arriving at the generic ear is (7)

(8) where and are the cepstra of the impulse response and the source signal respectively. The term represents the cepstrum of the additive noise term and is given by (9) where , and are the Fourier transforms of , and respectively. In most practical applications the background noise can be assumed to be low enough so that and its cepstrum can be neglected [10]. In the cepstral domain the global system impulse response can be written as the sum of a minimum phase component and an all pass component (APC) [7]. (MPC) Equation (8) becomes (10) The assumption justifying the use of cepstral prefiltering is that the MPC of the source signal cepstrum varies from frame to frame and is zero-mean, while the MPC of the room impulse response is slowly varying and can be estimated by average through time [10]. Generally only a few frames are sufficient for convergence [9]. The final estimate of the MPC of the channel is then subtracted from the received signal cepstrum , that after filtering is transformed back to the time domain. In particular, computation of the cepstral transform of each frame is preceded by application of an exponential window on received data, where and being the frame size. The objective is to move poles and zeros of towards the interior of the unit circle, thus increasing the weight of the MPC with respect to the APC [10]. V. STEPS OF THE ALGORITHM In summary, the azimuth estimation process is organized in the following steps. 1) Apply the exponential window to each frame of the two signals and . 2) Compute the MPC of the received signal at each frame. 3) Average the MPC through successive signal frames to get an estimate of . from the cepstrum of 4) Subtract the estimate of each signal frame. 5) Transform back to the time domain and apply the inverse of the exponential window. 6) Apply the azimuth estimation process based on joint ILD-ITD estimation.

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012

101

Fig. 1. Localization results for a female voice in a central azimuth position for , , and . For each figure, results are shown with no prefiltering on the left side and with cepstral prefiltering on the right side.

VI. EXPERIMENTAL RESULTS Experiments were realized by simulating a room of size with the image method [11]. Different reverberation times were considered, up to 700 ms.1 The head was put in the position . A female voice was used as source signal, positioned at different azimuth angles with respect to the head, while keeping the elevation angle set at zero degrees. Distance was set at one meter, as done in the CIPIC database [8]. The reference Kemar head impulse response was used, which is subject 21 in the CIPIC HRIR database [8]. Sampling frequency was kHz. Cepstral prefiltering was performed on 12 ms time frames, using an exponential window as indicated in [10]. 1The reverberation time is defined as the time needed for the energy to decay of 60 dB with respect to its initial value [6].

Fig. 2. Histograms of localization results at azimuth at different to ), (a) without and (b) with reverberation times (from cepstral prefiltering.

Fig. 1 shows the results obtained without and with cepstral prefiltering respectively, at different reverberation times and for a source positioned in a central position (azimuth ). In the figure, each column shows the detected azimuth as a function of frequency from top to bottom respectively by ILD only, ITD only and joint use of ILD and ITD. Darkest regions individuate the more likely angles. Improvements obtained with

102

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012

cepstral prefiltering can be seen, expecially for higher reverberation times. Fig. 2 and 3 illustrate the results obtained by the joint method for two sources positioned at and . The histograms of the estimated azimuth angles are shown for reverberation times up to . It is clear that localization performance are worse in any case for more lateral sources. Figures confirm the effectiveness of cepstral prefiltering in the presence of reverberation. It should be noted that for lateral positions of the source the efficacy of cepstral prefiltering is seriously hampered when the reverberation time is higher than 500 ms. VII. CONCLUSION Binaural source localization tasks can be performed by using ILD and ITD in a joint way. The presence of reverberation, which is typical in closed environments, can limit the quality of this solution. In this work a method to limit the adverse effect of reverberation on binaural signals is described. The proposed approach is based on cepstral filtering of acquired signals prior to azimuth estimation. It was shown that the performance of joint ITD-ILD azimuth estimators can be improved, also for lateral positions of the source. REFERENCES

Fig. 3. Histograms of localization results at azimuth at different to ), (a) without and (b) with reverberation times (from cepstral prefiltering.

[1] M. Raspaud, H. Viste, and G. Evangelista, “Binaural source localization by joint estimation of ILD and ITD,” Trans. Audio, Speech Lang. Process., vol. 18, no. 1, pp. 68–77, 2010. [2] J. Blauert, Spatial Hearing—The Psychophysics of Human Sound Localization. : MIT Press, 1996. [3] D. L. Wang and G. J. Brown, Computational Auditory Scene Analysis—Principles, Algorithms, and Applications. Piscataway, NJ: IEEE Press/Wiley Interscience, 2006. [4] C. Faller and J. Merimaa, “Source localization in complex listening situations: Selection of binaural cues based on interaural coherence,” J. Acoust. Soc. Amer., vol. 116, no. 5, pp. 3075–3089, 2004. [5] C. Zannini, R. Parisi, and A. Uncini, “Binaural sound source localization in the presence of reverberation,” in Proc. 17th Int. Conf. Digital Signal Processing (DSP2011), Jul. 2011. [6] H. Kuttruff, Room Acoustics. London, U.K.: Spon, 1999. [7] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing. Upper Saddle River, NJ: Prentice-Hall, 1989. [8] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano, “The CIPIC HRTF database,” in Proc. 2001 IEEE ASSP Workshop on Applications of Signal Processing to Audio and Acoustics (WASSAP’01), 2001. [9] R. Parisi, R. Gazzetta, and E. Di Claudio, “Prefiltering approaches for time delay estimation in reverberant environments,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’02), 2002, vol. 3, pp. 2997–3000. [10] A. Stéphenne and B. Champagne, “A new cepstral prefiltering technique for estimating time delay under reverberant conditions,” Signal Process., vol. 59, no. 3, pp. 253–266, 1997. [11] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating small-room acoustics,” J. Acoust. Soc. Amer., vol. 65, pp. 943–950, Apr. 1979.