Sound Source Distance Estimation in Rooms based on ... - CiteSeerX

3 downloads 71 Views 2MB Size Report
Mr. James Johnston. E. Georganti and J. Mourjopoulos are with the Audio and Acoustic Tech- nology Group, Wire Communications Laboratory, Electrical and ...
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

1727

Sound Source Distance Estimation in Rooms based on Statistical Properties of Binaural Signals Eleftheria Georganti, Tobias May, Steven van de Par, and John Mourjopoulos

Abstract—A novel method for the estimation of the distance of a sound source from binaural speech signals is proposed. The method relies on several statistical features extracted from such signals and their binaural cues. Firstly, the standard deviation of the difference of the magnitude spectra of the left and right binaural signals is used as a feature for this method. In addition, an extended set of additional statistical features that can improve distance detection is extracted from an auditory front-end which models the peripheral processing of the human auditory system. The method incorporates the above features into two classification frameworks based on Gaussian mixture models and Support Vector Machines and the relative merits of those frameworks are evaluated. The proposed method achieves distance detection when tested in various acoustical environments and performs well in unknown environments. Its performance is also compared to an existing binaural distance detection method. Index Terms—Binaural distance estimation, room transfer functions, spectral standard deviation.

I. INTRODUCTION

S

ound source localization and distance detection have a broad range of applications, such as intelligent hearing aid devices [1], auditory scene analysis [2], [3] and hands-free communication systems [4]–[6]. Distance detection problems are usually addressed with the use of microphone arrays, but in many applications only two microphones are available (i.e. two microphones placed in an artificial head), which makes the task of distance detection more demanding. More specifically, the knowledge of the actual distance between the source and receiver can be advantageous for various audio and speech applications such as denoising, dereverberation [7] and sound source Manuscript received September 03, 2012; revised February 05, 2013; accepted April 01, 2013. Date of publication April 25, 2013; date of current version May 09, 2013. This work was supported in part by the European Union (European Social Fund—ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF)—Research Funding Program: Heracleitus II. Investing in knowledge society through the European Social Fund. The associate editor coordinating the review of this manuscript and approving it for publication was Mr. James Johnston. E. Georganti and J. Mourjopoulos are with the Audio and Acoustic Technology Group, Wire Communications Laboratory, Electrical and Computer Engineering Department, University of Patras, 26500 Patras, Greece (e-mail: [email protected]). T. May is with the Centre for Applied Hearing Research, Department of Electrical Engineering, Technical University of Denmark, DK-2800, Kgs. Lyngby, Denmark. S. van de Par is with the Institute of Physics, University of Oldenburg, 26111 Oldenburg, Germany. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TASL.2013.2260155

separation methods where the relative distances between the microphones and the sound source(s) are generally not known. In real-life scenarios, such distance detection applications (i.e. wearable mobile devices) should be able to perform well in unknown acoustical environments. In such cases, a training stage cannot take place in the specific rooms in advance, indicating the need of novel methods that can predict distance in spaces with unknown acoustical properties. This work aims at this direction by proposing a method that is able to detect distance in rooms where the system has not been trained in advance. The problem of absolute distance detection is becoming more difficult inside enclosed spaces where reverberation can be a significant component of the received signals. Hence, the problem is also closely related to the estimation of the direct-to-reverberant Ratio (DRR) and this has been highlighted in several studies [8]–[10]. The DRR can be typically extracted from a measured room impulse response (RIR), but in practice RIRs are not always available since intrusive measurements within the rooms are required. For this reason, several methods have been developed recently that can blindly estimate the DRR from the reverberant signals. Lu et al. in [8] extracts the DRR from binaural signals by segregating the energy arriving from the estimated direction of the direct source assuming that reverberant components result in a spatially diffuse field. This method can be used as a distance estimator also, but it is designed to operate for distances above 2 m and requires knowledge of the . Recently, Hioka et al. [9] inroom’s reverberation time troduced a method for the DRR estimation using a direct and reverberant sound spatial correlation matrix model. This method was also utilized for measuring the absolute distance and the results show that it can be used for a restricted close range of distances. However, this method has been mainly designed for microphone arrays, which usually consist of more than two microphones and where the impact of the Head Related Transfer Function (HRTF) is not considered. More recently, a method for the DRR estimation has been proposed in [10], where an analytical relationship between the DRR and the magnitude-squared coherence is introduced. The method does not require any special measurement signals or training data and the mean DRR estimation error was found to be higher for low frequencies and lower for higher frequencies. However, this method was not designed specifically as a distance estimator, thus the calculation of the absolute distance from the DRR would require knowledge of either the critical distance or the reverberation time and volume of the room. Other methods for absolute distance estimation using features related to the DRR have been proposed in [11] and [12]. In [11], the cross-spectra of the signals recorded by two microphones are

1558-7916/$31.00 © 2013 IEEE

1728

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

used in order to recognize the position of a sound source. This method was later improved by Vesa [12] in order to account for the positions that have the same azimuth angle using the magnitude squared coherence between the left and right ear as a feature for the training of a Gaussian maximum likelihood scheme for distance detection. Because the human auditory system is capable of analyzing the spatial characteristics (distance, reverberation, orientation angle, etc.) of complex acoustic and reverberant environments by exploiting the interaural differences between the two ears [13], most of the aforementioned DRR and distance estimation methods aim at mimicking to some degree processing stages of the human auditory system. These binaural cues are the interaural time differences (ITDs), interaural level differences (ILDs) and the interaural coherence (IC), where ILD is the level difference between the sound arriving at the two ears, ITD is the difference in the arrival time of a sound between the two ears and IC expresses the strength of correlation between the left and right ear signals. These binaural cues are affected by the reverberant energy as a function of the listener’s location in a room and the source location relative to the listener [14][15][16]. More specifically, reverberant energy decreases the magnitude of ILDs and this effect depends on the actual distance of the listener from the microphones [14]. In addition, the ITD fluctuations across time depend on the amount of reverberation on the signal [15] and the IC is directly related to the DRR [10], [17]. Thus, the analysis of binaural cues can assist distance detection methods and those well-established features will be exploited by the proposed method. In this study, a novel feature for the detection of distance from binaural signals is proposed based solely on received binaural signals. The proposed method does not require a priori knowledge of the room impulse response, the reverberation time or any other acoustical parameter and relies on a set of novel features extracted from the reverberant binaural signals. The features are incorporated into two different classification frameworks based on Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs) and the relative merits of those two frameworks are evaluated. For this method, a novel distance estimation feature is introduced exploiting the standard deviation of the difference of the magnitude spectra of the left and right binaural signals (termed here as Binaural Spectral Magnitude Difference Standard Deviation (BSMD STD)). This feature is shown here to be related to the statistics of the corresponding room transfer function (RTF). The RTF magnitude spectral standard deviation has been extensively studied in the past [18] and it has been shown that it is highly correlated with the source/receiver distance. In this direction, the current study also introduces novel relationships and extends previous findings regarding the singlechannel RTF spectral standard deviation and its dependence on the source/receiver distance to binaural room transfer functions (BRTFs). Moreover, an extended and novel set of additional features based on the statistical properties of binaural cues (ILDs, ITDs, ICs) is extracted from an auditory front-end [19] which models the peripheral processing of the human auditory system. The proposed BSMD STD feature along with the additional

binaural features are incorporated into two classification frameworks for distance detection. The work also employs two different feature selection approaches in order to identify which of the above features are the most important for the proposed distance estimation method (more details will be given in Section IV-A). The selected features are used by the distance detection method and the method is evaluated under realistic and variable test conditions. Aspects such as the number of selected features, the number of distance classes (detection resolution) and the ability of the algorithm to generalize to unknown acoustical environments are discussed. The method is compared to the distance learning method previously developed by Vesa [12] and the ability of both methods to perform in unknown environments is discussed. This paper is organized as follows: Given the significance of room reverberation on distance detection, in Section II, the dependence of the spectral standard deviation of RTFs on distance is discussed based on the existing literature[18], [20], [21] and on previous work by the authors [22]. In Section III, the proposed features are presented. More specifically, in Section III-A the novel BSMD STD feature is introduced and in Section III-B, the set of binaural features that is based on the statistical properties of the binaural cues (ITD, ILD and IC) are presented. Section IV describes the complete implementation of the distance detection model based on the extracted features described in Section III. In Section V, the experimental results on distance detection are presented and the proposed method is compared with the distance learning method developed by Vesa [12] in Section V-C. Finally, the paper concludes with a summary of the present work. II. SPECTRAL STANDARD DEVIATION TRANSFER FUNCTIONS

OF

ROOM

As already mentioned, distance detection in rooms is substantially affected by the RTF properties. The analysis of RTFs, for frequencies where there is high modal overlap, relies on the early statistical model as developed by Schroeder [18]. Assuming a sine wave excitation which is recorded with a microphone located in the room’s reverberant field (beyond the critical distance), the recorded signal would consist of the sum of contributions of a large number of modes, where the real and the imaginary parts of the complex sound pressure can be considered as independent Gaussian processes that have the same variance. This two-dimensional Gaussian density arises from the central limit theorem, assuming independence between the modes, implying that the magnitude frequency response follows a Rayleigh distribution. These statistical properties are valid irrespective of the microphone position (assuming that the direct sound contribution is negligible compared to the reverberant energy) or the room dimensions and properties, as long as the frequencies are above Schroeder’s frequency [23]. In the work of Ebeling [20], the dependence of the RTF spectral standard deviation on the DRR has been derived mathematically as (1)

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

1729

i.e. using fractional-octave bandwidths. Here, the RTF standard deviation for a specific frequency band is defined as

(6)

is the RTF magnitude spectrum in dB, and where define the frequency range of interest, is the discrete frequency bin and is the spectral magnitude mean for the specific frequency band, given by (7) Fig. 1. Spectral magnitude standard deviation of the RTF as a function of the distance between source and receiver given in critical distance units. This figure has been plotted according to [21].

Note that the term refers to the normalized spectral standard deviation of the RTF, taking a maximum value of one. In addition, according to [21], the DRR is equal to (2) where is the distance between the sound source and receiver given in critical distance units as (3) where is the actual source/receiver distance (in m) and the critical distance of the room, given by [24]:

is

In previous work by the authors [22], the spectral magnitude standard deviation of RTFs measured at various positions within different rooms of various reverberation times has been examined. It was shown that the standard deviation values increase with distance and converge to values close to the expected 5.57 dB value when using a fractional octave band analysis (i.e. 1/3) for frequencies above Schroeder’s frequency [23]. III. BINAURAL FEATURES Here, the extraction of the proposed binaural features for distance detection is described. Fig. 2 demonstrates the total procedure of the binaural distance detection method consisting of the binaural feature extraction, feature selection and classification. In this section, the first block which describes the binaural feature extraction procedure is described in detail. More details on the blocks of the features selection and the classifiers will be given in Section IV. A. Spectral Standard Deviation Feature

(4) is the volume of the room, is a constant equal where V to 0.161 and (sec) is the reverberation time of the room. Combining (1) and (2) leads to (5) Equation (5) indicates the systematic dependency of the distance between source and receiver on the RTF spectral standard deviation. This dependency of the spectral magnitude standard deviation of the RTF as a function of the distance between source and receiver given in critical distance units is plotted in Fig. 1. The curve is plotted according to the model (the actual values can be found in [21]) and it is evident that the RTF spectral standard deviation depends on the distance between source and receiver up to the room’s critical distance value. The standard deviation of RTFs can be calculated for the full frequency range (i.e. audible range) or for specific sub-bands,

In this section, we will discuss how the previously described statistical analysis of RTFs can be related and extended to the statistical properties of BRTFs. Based on this analysis, a novel feature for distance detection will be proposed. Let us assume an anechoic signal that is reproduced at a certain position within the room and is binaurally recorded using a pair of microphones attached to the ears of a manikin. The left and the right ear recordings, denoted by and , can be written as: (8) (9) and are the corresponding left and right ear where BRIRs at the corresponding source/receiver positions. The magnitude spectra of the binaural signals, and in dB are given by (10) (11)

1730

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

Fig. 2. Block diagram demonstrating the binaural feature extraction, feature selection and classification procedure based on the binaural signals.

where corresponds to the frequency index. Similarly, the magnitude spectra of the left and right ear BRIRs, and and of the anechoic signal in dB are given by, (12) (13) (14) Applying the Fourier transform to (8) and (9) and using (10)–(14), it can be written: (15) (16) Denoting the long-term (i.e. 2 sec) magnitude spectrum difference of the two signals between the two ears as (17) and combining (15), (16) and (17), it yields that: (18) denotes the left and right binaural signal difwhere ference. Given that and in (18) express the difference of a pair of random variables, a relationship for the variance value of their probability distributions can be derived as [25]: (19) where corresponds to the variance of . Similarly, the corresponding standard deviation (STD) values [25] are also related as (20) and correspond to the spectral standard deviation where values of the BRTFs, and . Eq. (20) implies that the standard deviation of the difference of the magnitude spectra of

the left and right binaural signals depends on the standard deviation values of the corresponding BRTFs and their covariance. The value of is the BSMD STD feature. Therefore, it is now of interest to discuss whether the spectral standard deviation values of the BRTFs, and (see (20)) present a distance dependent behavior similar to that of the corresponding single-channel RTF spectral standard deviation discussed in Section II. In [26], it has been shown that the spectral standard deviation value of a BRTF is related to the spectral standard deviation of the corresponding RTF measured at the same position without the presence of the head as: (21) where is a positive constant that mainly depends on the orientation angle with respect to the source but also on the reverberation time and the source/receiver distance. The factor cannot be readily deduced by analytical methods and such a study is beyond the scope of the present work (see [27] for a preliminary model for the effect of a reflection on the anechoic HRTF). For this reason, the effect of this additional “binaural factor” on the RTF statistics will be experimentally deduced here via dummy head measurements taken for difference source/receiver distances in various rooms. The measurement details will be given in Section IV-C. These measurements have indicated that the factor is in the range of 1.4 dB. In Fig. 3, the standard deviation of the spectral magnitude of the RTF and the BRTF (left & right channel) is plotted using the squares and the black and gray circles, respectively. The black solid line illustrates Jetzt’s experimental values (taken from [21]) corresponding to the omnidirectional microphone receivers. The gray line illustrates the same values shifted by 1.4 dB in order to conform to (21). Note that the value of 1.4 dB corresponds to the mean increase of the spectral standard deviation values for the BRTFs calculated from all the available measurements corresponding to various source/receiver distances in various rooms. It is evident therefore, that the presence of the

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

Fig. 3. Spectral standard deviation of RTFs and BRTFs (left & right) measured at the same positions. The black solid line shows Jetzt’s model [21] and the gray line shows Jetzt’s model shifted by 1.4 dB.

pinna and head increases the standard deviation values of the corresponding RTFs. Combining now (20) and (21), it yields that (22) where and are the corresponding constants for the left and right BRTFs and and are the standard deviation values of the left and right single-channel RTFs. The BSMD STD of the left and right binaural signals can be calculated according to (6) and (7) as (23) where and define the frequency range of interest, is the discrete frequency bin and is the spectral magnitude mean for the specific frequency band, given by (24) will From now on, this novel differential spectrum feature be referred to as BSMD STD and is employed in the proposed feature extraction procedure as can be seen in Fig. 2. The proposed feature is expected to present distance dependent values according to the previous discussion and the analysis in Section II. This dependence has been confirmed experimentally using several signals and for various source/receiver positions within various rooms and here some indicative results are presented. Fig. 4 shows such typical results for the BSMD STD feature extracted from speech signals as a function of time and the corresponding histograms for the same feature. For these examples, the room had a reverberation time of 0.12 sec and the measurements were taken at different source/receiver distances (0.5 m, 1 m, 1.5 m). It is evident that the proposed feature values present a clear distance dependent behavior and there is not

1731

Fig. 4. (a) The proposed BSMD STD feature extracted from speech signals recorded at different source/receiver distances in a small room with a reversec and (b) the corresponding histogram of the beration time of extracted BSMD STD values.

much overlap among the three distance classes. Nevertheless, given the well-established dependence of other binaural cues on distance perception, the proposed feature, along with other features corresponding to binaural cues, is combined with other binaural features, as it will be discussed in the next section. B. Binaural Features Based on Statistical Properties of Binaural Cues In this work, an auditory front-end which models the peripheral processing stages of the human auditory system [19] is used. First, the acoustic signal is split into 8 several auditory channels with center frequencies equally spaced on the equivalent rectangular bandwidth (ERB)-rate scale using a fourth-order Gammatone filterbank. Note here that by choosing a number of 8 auditory channels, the bandwidth of the individual filters will be larger, which in turn is expected to increase the robustness of the estimated binaural cues, while maintaining the frequency-specific characteristics. The system was also tested using a higher number of auditory channels (16 & 32), but performance was not very sensitive to the particular choice of the number of auditory filters. For this reason, 8 channels are used in the current study. More specifically, phase-compensated Gammatone filters are employed in order to synchronize the analysis of binaural cues across all frequency channels at a given time point [28]. Then, the neural transduction process in the inner-hair cells is approximated by halfwave-rectification and square-root compression. Based on this representation, the binaural cues of ITD, ILD and IC are estimated for short frames of 20 ms for each Gammatone channel. The ITD is defined as the time delay of the arrival of sound between the left and right ears, and is estimated by computing the normalized cross-correlation function for time lags within the range of [ 1,1] ms. The time lag which corresponds to the maximum in the normalized cross-correlation function reflects the estimated time delay in samples. Furthermore, the maximum value of the cross-correlation function indicates the strength of correlation between the left and right signals. The ILD is given by the level difference between the

1732

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

two ears. The estimated binaural cues for the -th Gammatone channel and the -th time frame are denoted by where refers to each respective binaural cue: (25) Based on the binaural analysis of the auditory frond-end, a set of features is extracted to capture the statistical properties of the estimated binaural cues. The feature extraction procedure is depicted in Fig. 2. Here, it should be noted that since the ILDs and ITDs inherently dependent on the sound source direction, the statistical quantities are chosen such that they do not reflect this direction-dependent behavior. Standard deviation (STD): The standard deviation extracted from a binaural cue is defined here as

well described by the distance between the percentile as

and the (32)

Percentile symmetry (PSYM): The symmetry of an histogram can be investigated by looking at the difference of the percentiles as (33) This property is near zero for symmetrical distributions, positive for left sided distributions, and negative for right sided distributions. Percentile skewness (PSKEW): The percentile skewness is defined as the difference between the 50% percentile and the median as

(26)

(34)

where is the index of the Gammatone channel, k indexes the frame, N is the total number of frames and is the average of over frames given by

For asymmetrical distributions the difference between the 50% percentile and the approximated median should be large, for symmetrical distributions approximately zero. Percentile kurtosis (PKURT): The percentile kurtosis corresponds to the approximation

(27) (35) Variance (VAR): The variance extracted from a binaural cue is given by (28)

which indicates whether the distribution has a narrow or broad peak. Lower half percentile (PLHALF): The lower half percentile feature is expressed as

Kurtosis (KURT): The kurtosis extracted from a binaural cue is given by

(36) and indicates whether the histogram is right sided.

(29)

Skewness (SKEW): The skewness extracted from a binaural cue is given by (30)

Average deviation (ADEV): The average deviation extracted from a binaural cue is given by (31) Percentile width (PWIDTH): Ludvigsen [29] proposed a set of statistical features based on the percentile properties and their relationships. The m-th percentile of a binaural cue for the -th Gammatone channel is that value of the binaural cue that corresponds to a cumulative frequency of , where N is the total number of frames. The width of the histogram is

IV. DISTANCE DETECTION METHOD The proposed distance detection method is based on the set of binaural features described in the previous section. The method employs two feature selection algorithms (see Section IV-A) and two classification techniques (GMMs, SVMs) in order to model and evaluate the degree of distance dependency of the features. The complete procedure of the distance detection method can be seen in Fig. 2. A. Feature Selection Algorithm Feature selection algorithms are widely used in pattern recognition and machine learning to identify the features that are relevant for a particular classification task. In this work, two different feature selection approaches have been tested in order to extract the most relevant and beneficial features for the distance detection task. Initially, an “heuristic” (HEUR) feature selection approach was evaluated, where on each iteration the feature set is extended with the best performing feature. Such approach can investigate the contribution of each of the statistical features to the distance detection method and determine the features

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

that are distance-dependent based on an empirical procedure. Apart from the HEUR approach, the minimal-redundancy-maximal-relevance criterion (mRMR) [30] is used in order to select features that strongly correlate with the class labels (distance class in this case). The mRMR method is able to directly select the top features (the user defines the number of the selected features) and this has the practical advantage that it is an non-iterative method selecting subsets rather than the full set of features. The main difference between the two approaches is that the HEUR approach leads to the selection of one feature each time, where the mRMR approach leads to the selection of feature subsets. The two feature selection approaches were compared and it was found that the mRMR approach leads to higher performance rates than the HEUR approach. For this reason, the mRMR method has been employed in this work. The results of this comparison can be found in the appendix along with the details on the HEUR feature selection procedure. The exact procedure that was followed employing the mRMR feature selection algorithm will be given along with the presentation of the experimental results in Section V. B. Classification Methods The proposed method is based on the features described in Section III which are used by the classifiers. In this study, two classification approaches have been used, namely GMMs and SVMs. 1) Gaussian Mixture Models: The proposed method is based on the features described in Section III which are used to train a classifier based on GMMs. GMMs can be used to approximate arbitrarily complex distributions and are therefore chosen to model the distance-depending distribution of the extracted features [31], [32]. In this work, each discrete distance that should be recognized is represented by a GMM. The distance which maximizes the a posteriori probability for the given feature vector is selected as the detected distance among all trained models. For the actual implementation the matlab code that can be found in the Netlab toolbox [33] was used. Each class is referred to with its three parameter sets ( , , ). The GMM was initialized using the k-means algorithm [34]. The expectation-maximization (EM) algorithm [31] was used to estimate the set of GMM parameters with a maximum number of 300 iterations. 5 Gaussian components and diagonal covariance matrices were used. 2) Support Vector Machines: A multi-class SVM with a radial basis function is constructed using the LIBSVM library [35]. The classifier is designed to distinguish between the distance classes. C. Recordings Database In order to train and evaluate the system, several anechoic speech recordings were used, taken from various recording databases [36], [37]. In total, 58 minutes of speech recordings have been used. 65% of the recordings dataset has been used for the training and 35% for the testing of the method. In addition, BRIRs were taken from the Aachen Database [38] and from [12]. The recordings were convolved with the BRIRs measured at different distances between source and receiver

1733

TABLE I. VOLUME AND REVERBERATION TIME OF THE ROOMS

in five different rooms. The volume, reverberation time , and the source/receiver distance for each of the rooms can be seen in Table I. The sampling frequency was 44.1 kHz. The orientation angle of the measurements was 0 for rooms A, D and E and 0 , 30 , 60 , 90 and 180 for rooms B and C. D. Feature Extraction Details The BSMD STD feature is extracted from the binaural reverberant signals which are segmented into blocks of 1 sec (using hanning windowing). For each block, the BSMD STD feature is calculated for a specific frequency band. In [22] and [39], it has been shown that the RTF standard deviation appears to be highly distance dependent when it is calculated from sub-bands (i.e. fractional octave bands) for frequencies higher than the Schroeder frequency. For frequencies lower than the Schroeder frequency [23], some ambiguity for the RTF standard deviation value appears and the derived value is highly dependent on the bandwidth that is used. Since the BSMD STD feature is related to the RTF standard deviation as discussed in Section III-A, several tests were initially performed in order to choose the frequency band that appears to be more distance dependent. For this reason, the BSMD STD feature was also calculated for narrow frequency bands (i.e. matching the center frequencies of the Gammatone filterbank), but the results indicated that broader frequency bands lead to better results. After several informal tests, the frequency band of 0.2–2.5 kHz was found to contribute mostly to the performance of the method thus it was used for the extraction of the feature. Note that the frequency of 0.2 kHz is above the Schroeder frequency for all tested rooms. For the extraction of the features described in Section III-B, the signals are split into 8 auditory channels with center frequencies equally spaced on the equivalent rectangular bandwidth (ERB) scale between 50 and Hz using a fourth-order Gammatone filterbank. More specifically, the center frequencies were: 50 Hz (1ch), 293 Hz (2ch), 746 Hz (3ch), 1594 Hz (4ch), 3179 Hz (5ch), 6144 Hz (6ch), 11686 Hz (7ch) and 22050 Hz (8ch). Then, the binaural cues of ILD, ITD and IC are estimated for 20-ms frames with 50% overlap for each Gammatone channel. From the binaural cues, the statistical features discussed in Section III-B are computed also for blocks of 1 sec duration. In total, 240 features cues channels features are extracted from the binaural cues. This yields a total set of 241 available features with the addition of the feature BSMD STD for one frequency band (0.2–2.5 kHz). V. EXPERIMENTAL RESULTS In this section, the method is evaluated for various test cases. The performance measure that is used in this work is the mean

1734

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

TABLE II. TEST CASES (FINE DISTANCE DETECTION)

Fig. 5. Performance of the method when using the BSMD STD feature for rooms A-E for speech signals for three distance classes (coarse distance detection) using two classification approaches (SVMs, GMMs). The orientation angle was zero degrees for all rooms.

classification performance which is calculated by taking the mean of the diagonal of the confusion matrix. First, the method is evaluated using the proposed BSMD STD feature within the five different rooms for three distance classes (Section V-A). Then, the method is also tested for more distance classes (seven distance classes) using only the proposed BSMD STD feature and when using the additional proposed features selected by the feature selection algorithms (Section V-B). The effect of the number of selected features on the distance detection performance is discussed. In Section V-C, the proposed method is compared to the distance detection method proposed by Vesa [12]. It is also compared to a modified version of Vesa method and the generalization ability of the methods is examined. More specifically, we examine whether the methods are able to detect the sound source distance in rooms which were not included in the model training. A. Coarse Distance Detection The method is initially tested using only the proposed BSMD STD feature in rooms A, B, C, D and E and for three different distance classes illustrating its performance for coarse distance detection tasks. For rooms B, C, D and E these distance classes were 1 m, 2 m and 3 m and for the smallest room A, the distance classes were 0.5 m, 1 m and 1.5 m as a distance measurement above 2 m was not available due to the small size of the room. The performance of the method across all rooms using GMMs and SVMs can be seen in Fig. 5. It is evident that the method employing the BSMD STD feature alone achieves high performance being always higher than 90% for all rooms. The two different classification approaches (SVMs, GMMs) lead to very similar performance rates. For the larger and more reverberant rooms, it can be seen that there is a decrease in performance by approximately 5%. Note that for these tests the orientation angle was zero degrees for all rooms. B. Fine Distance Detection In Section V-A, it has been shown that the method using the BSMD STD feature alone can successfully identify the correct

Fig. 6. Performance of the method for rooms B, C and E as a function of the number of features used by the classifiers (SVMs and GMMs) for seven distance classes (fine distance detection). The first feature is always the BSMD STD and the orientation angle was zero degrees for all rooms.

distance with performance rates being higher than 90% within all rooms for three distance classes. It is now of interest to examine whether the method using the same feature can perform equally well when there are requirements for higher distance resolution. Here, the method is evaluated for seven distance classes (fine distance detection) across three different rooms (B, C and E) and the specific tested distances can be seen in Table II. When the method is tested for seven distance classes, using the BSMD STD feature only, the performance presents a decrease of at least 20%, for rooms B, C and E respectively, compared to the coarse distance estimation results (for only three distance classes) that can be seen in Fig. 5. More specifically, the performance rates decrease to 72%, 75% and 64% for rooms B, C and E respectively, when GMMs are used for the classification. These results can be seen in Fig. 6, where the method’s performance is plotted as a function of the number of employed features. The first feature is always the BSMD STD for all rooms. More details on this figure will be given in the next paragraph. Since the BSMD STD feature is not sufficient for higher distance resolution, in order to increase the performance of the method, the additional binaural features discussed in Section III-B are used. For the selection of the most relevant and beneficial features for the distance detection, the mRMR feature selection approach has been used—see Section IV-A. The mRMR method was employed for rooms B and C using the speech training dataset consisting of all orientation angles (0 , 30 , 60 , 90 and 180 ) and fourteen experiments were conducted for each room. Using the dataset of both rooms B and C, in the first experiment, the top (one) performing feature was selected, in the second experiment the top two features were selected, in the third experiment the top three features

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

1735

Fig. 7. (a) Performance of the method for rooms B and C as a function of the number of features used by the classifiers, when the system is trained and tested “with each orientation angle” separately and then the mean test performance over the five experiments (0 , 30 , 60 , 90 and 180 ) is taken. (b) Performance of the method for rooms B and C when the system is trained and tested with all orientation angles (0 –180 ). The detected distance classes were seven (1 m, 1.5 m, 2 m, 2.5 m, 3 m, 3.5 m, 4 m).

were selected, etc. The selected features using the mRMR feature selection approach can be found in the appendix in Table V. It is worth noting that the proposed feature BSMD STD was chosen by the mRMR method as the top feature among the 241 features and thus appears to be particularly important for the distance detection task. It can be seen that apart from the BSMD STD, the next four features that were selected are all related to the ILD cue statistical properties and more specifically to the PSKEW statistical quantity. The PSKEW indicates the degree of asymmetry of a distribution and for symmetrical distributions its value will be close to zero. The results here indicate that an increase of distance leads to more asymmetrical distributions of the ILDs across time due to the domination of the diffuse field. The results for increasing the number of features (see appendix—Table V) for rooms B, C and E using the two classification approaches (SVMs, GMMs) can be seen in Fig. 6. For these results, the method was trained and tested in each room separately and the orientation angle of the recordings used was zero degrees for all three rooms. It is evident that the addition of more features increases the performance of the method, but when many more features (i.e. more than 15) are employed by the classifiers the method performance gets lower. For this reason, results are presented for up to fourteen features only. The performance in rooms B and C increases by adding more features, up to the number of nine features and room E presents the highest performance for eleven features. In all three rooms when GMMs are used, the addition of the 13-th feature which is the VAR ILD of the 746 Hz frequency band (see Appendix) decreases the method’s performance. Note that the features used here lead to high performance rates for room E, although room E was not used in the feature selection procedure. In rooms B and C, a steep increase of the performance is achieved by adding four features to the top one and SVMs led to lower performance rates than GMMs. In order to determine whether the selected features (see appendix—Table V) depend on the measurements’ orientation an-

gles, some additional experiments were conducted. In Fig. 7(a), the mean test performance of the method for rooms B and C can be seen, when the system is trained and tested with each orientation angle separately (i.e. the system is trained with measurements taken at 30 and then tested with measurements also taken at 30 , etc.). Following this procedure for all orientation angles, the mean over the five experiments (0 , 30 , 60 , 90 and 180 ) is taken. The results shown in Fig. 7(a) correspond to the case where the orientation angle is assumed to be known and it can be observed that GMMs lead to higher performance than SVMs. Additionally, the performance of the method is also plotted (Fig. 7(b)) for the case where the system is trained and tested with all orientation angles (0 –180 ) within rooms B and C. In such challenging task, the SVMs appear to achieve better performance rates than the GMMs. Note that the results in Fig. 7 are presented for seven distance classes (1 m, 1.5 m, 2 m, 2.5 m, 3 m, 3.5 m, 4 m). Based on these results, it is evident that performance in rooms B and C increases by adding more features, up to the number of nine features, where maximum performance is achieved. However, employing the top five features that were selected using the mRMR feature selection approach (see appendix—Table V) leads to performance rates that are always above 75% for all three rooms, for the seven distance classes and for any orientation angle. These results indicate that the BSMD STD feature and the selected additional features are distance-dependent for any orientation angle. As in the case of zero degrees orientation angle (see Fig. 6), the addition of the 13-th feature (VAR ILD of 746 Hz) decreases the performance for known orientation angles, when GMMs are used. When the orientation angle is not known the performance is affected by the addition of this feature for both SVMs and GMMs. It also appears that the optimal feature number that should be employed by the classifiers for maximum performance is nine. However, if there is a need for less computational power and faster processing, then five features may be used in which case a 5–10% lower performance rate can be expected.

1736

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

Fig. 8. Performance comparison of the proposed method to the Vesa method [12] for: (a) three different distance classes using the BSMD STD feature and, (b) seven different distance classes using the top five and nine selected features (see appendix—Table V). The orientation angle was zero degrees.

C. Comparison to the Baseline Method In this section, the proposed method is compared to the binaural distance detection method previously introduced by Vesa in [12]. This method employs only one feature (magnitude squared coherence) and here, the optimal parameters have been used for the evaluation of the method (for details see [12]). More specifically, the frequency range for the extraction of the magnitude squared coherence was 215–13566 Hz, the percentage parameter was 1% (indicates the percentage of time frames that are used for the distance class decision), the time-domain Hanning window size was 128 samples, the hop size was 32 samples and the FFT size was 1024 samples and a time constant of 300 ms was used. A decision for the distance class was taken for each frame of 1 sec duration (clip-level performance, [12])—using equal frame size to the proposed method. In [12], the training data of Vesa method consisted of white noise convolved with BRIRs. The noise signal has a duration of 4 sec indicating that the Vesa method requires less training material than the proposed method. Here, a modified version of Vesa’s method is also tested, where speech material (“Vesa-speech”) is used in the training stage instead of white noise (“Vesa-noise”). The proposed method is compared to both versions of the previous method. The comparison to the proposed method was performed both on identical rooms as the ones tested in [12] (rooms B and C) and also on the additional rooms A, D and E. The results from the performance comparison for all rooms (see Table I) for three distance classes (coarse distance detection) can be seen in Fig. 8(a). For rooms B, C, D and E the distance classes were 1 m, 2 m and 3 m and for the smallest room A, the distance classes were 0.5 m, 1 m and 1.5 m. In all cases, the orientation angle was zero degrees and GMMs were used for the distance classification. For such coarse distance detection task only the BSMD STD feature has been used by the proposed method (see discussion in Section V-A). It is evident that the performance for rooms A, B and C is almost identical for both methods and their detection performance is above 95% for all cases. In rooms A and D, the performance

of Vesa method is slightly higher, compared to the proposed method. The Vesa method fails to detect distance correctly in room E while the proposed method achieves very high performance rate (94%). However, the Vesa method has been tuned to perform well in small rooms, thus it might be possible to perform well also in larger rooms with an appropriate tuning of the employed parameters. The modified Vesa method that used speech signal for the training stage leads to similar performance rates to the original method (i.e. training with noise signal) for all tested rooms. Moreover, a performance comparison test was made for seven distance classes for the rooms B, C and E according to Table II (fine distance detection) and the results can be seen in Fig. 8(b). In this case, the orientation angle was zero degrees. For the proposed method, the top five and top nine features (see appendix—Table V) were used according to the discussion in Section V-B. The Vesa method leads to similar performance rates when it is trained with either speech or noise signals for all tested rooms. Although both methods show similar performance in small rooms, the proposed method seems to be more robust in reverberant environments. Increasing the number of employed features to nine seems to be beneficial only for room B. In the case of room C, the proposed method presents lower performance than the Vesa method. Note here that the parameters employed by the Vesa method were tuned to work in rooms B and C (see [12]). 1) Effect of Time Frame Length: For all the results that have been presented in the previous sections, the length of the time frame was 1 sec. Here, the effect of the time frame length on the methods’ performance is discussed (seven distance classes). For this reason, several additional experiments were conducted, where the methods were evaluated using longer time frames. More specifically, for the proposed method, the feature values were accumulated over longer time periods by multiplying the class-dependent probabilities over multiple frames of 1 sec. For the Vesa method, the length of the frames used for the assignment of distance was increased. More details can be found in [12] (referred to as “clip-level performance”).

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

1737

TABLE IV. TEST CASES AND COMPARATIVE PERFORMANCE FOR THE PROPOSED AND VESA METHODS FOR SEVEN DISTANCE CLASSES (FINE DISTANCE DETECTION)

Fig. 9. Performance of the proposed method using five features (see appendix—Table V) compared to Vesa method for rooms B and C as a function of the length of the time frame used for the distance detection. TABLE III TEST CASES AND COMPARATIVE PERFORMANCE FOR THE PROPOSED AND VESA METHODS FOR THREE DISTANCE CLASSES (COARSE DISTANCE DETECTION)

In Fig. 9, the performance of the proposed method using five features (see appendix—Table V) is compared to the Vesa method trained here using a noise signal, given that similar performance rates were previously obtained for speech training signals. For the proposed method, GMMs were used for the classification. It is evident that longer frames lead to better performance rates for both methods. Now, the Vesa method converges to performance rates of 100% for rooms B and C, when frames of approximately 6 sec and 7 sec are used, respectively. The proposed method also converges to 100% rate for room B when time frames of 7 sec are employed, whereas for room C the performance was lower by approximately 3% even for the 10 sec frame. 2) Generalization Ability of the Methods: The previously described results require training and testing within the same room, but it is of interest to examine whether the proposed method is able to generalize to unknown rooms (i.e. training in one of the rooms and tested in another one). For this reason, some additional experiments were conducted where various combinations of rooms were examined. Throughout these experiments, the GMM classifier was used. The room combinations for training and testing can be seen in Table III. The first column contains the room where the method is trained and the second column contains the room where the method is tested. In the third column, the mean performance for each room combination for three distance classes (1 m, 2 m, 3 m) using the BSMD STD feature (coarse distance detection). In the fourth and fifth column, the performance of the two versions of Vesa methods (where speech or noise are used for training) can be seen.

It is evident that for the proposed method, the highest performance is achieved for the combination of rooms B and C that have a small difference in their reverberation time (approximately 0.2 sec). In this case the Vesa method performs worse. However, for rooms C and D, having a difference in their reverberation time of approximately 0.3 sec, the performance of the proposed method is lower (70.5% and 77.6% respectively) and for the combination of room C and D, the Vesa method performs better. Similarly, when the method is trained in room C and tested in room E, the Vesa method performs better than the proposed method. For all other room combinations, the proposed method achieves better performance rates that the Vesa method. The combination of rooms E-D and E-C lead to low performance rates for any of the methods and this might be explained by the fact that there is a substantial difference in their corresponding reverberation times. Hence, it can be concluded that for crude distance classification, the proposed method may potentially generalize to unknown acoustical room environments, if the model is trained with reverberation characteristics that are expected in the testing phase. Extending the training of the model with a large collection of different rooms could potentially further increase the performance of the method. For finer distance detection resolution (according to Table II) and when using the BSMD STD feature, the comparative results can be seen in Table IV. Here, the performance appears to be lower than when trained and tested in the same room and for the proposed method the mean performance is approximately 62%. The Vesa method presents 10 to 20% lower performance rates. The results were extracted using only the BSMD STD feature and indicate that more features could be potentially employed by the method in order to further improve the distance recognition performance in unknown room environments (as described in Section V-B). Hence, several additional tests were performed, where more features (based on either the mRMR or the HEUR approach) were employed by the distance classifiers. However, the average performance was always lower than when using only the BSMD STD feature. As discussed in Section V-C-1, another way to increase the performance is by increasing the time frame size. For this reason, the generalization ability of the methods was also examined for longer time frames. In Fig. 10, the performance of the two methods as a function of the length of the time frame (using the BSMD STD feature) can be seen. In this case, the system is trained in room B and tested in room C and vice versa. It is evident that an increase of the time frame size improves significantly the classification ability of the proposed method leading to performance rates above 75% for both rooms and for time frames of up to 10 sec. Similarly, the Vesa method benefits from an increase of the time frame length, however its performance remains significantly lower than this of the proposed method.

1738

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

Fig. 10. Performance of the proposed method (using the BSMD STD feature) and the Vesa method when the system is trained in room B and tested in room C and vice versa as a function of the length of the time frame for seven distance classes.

VI. CONCLUSIONS This study introduces a novel method for detecting the source/receiver distance within rooms, based solely on features extracted from the received binaural signals. The proposed method employs a novel feature (BSMD STD) along with several other binaural features based on the statistical properties of binaural cues. The BSMD STD feature is shown to be related to the standard deviation of the magnitude spectrum of RTFs which is known to depend on the source/receiver distance [21]. In order to extract the most relevant and beneficial features for the distance detection task, two different feature selection approaches (HEUR, mRMR) were tested. It was found that the mRMR approach leads to higher performance rates than the HEUR approach and for this reason, the mRMR method has been employed in this work. Based on the selected features, two classification frameworks based on GMMs and SVMs were employed for the implementation of a distance detection system. It has been shown that the proposed BSMD STD feature alone can successfully predict the distance class with high performance rates (above 90%) when tested in five different acoustical environments and when a coarse distance detection (three distance classes) is required. For finer distance detection (seven distance classes) and various orientation angles, a decrease of the method’s performance was observed, thus additional binaural features have been included based on the mRMR feature selection algorithm. The BSMD STD was chosen by the mRMR method as the most important feature among the 241 features and the statistical property PSKEW of the ILD binaural cue appears to be also particularly important and angle-independent for such a distance detection task. The PSKEW indicates the degree of asymmetry of a distribution, implying that an increase of distance leads to more asymmetrical distributions of the ILDs across time. In future work we will examine in more detail the dependency of the statistical quantities of binaural cues on distance in order to derive their mathematical or perceptual relationships (i.e by performing psychoacoustic tests, modelling the variation of the

statistical distributions with distance, etc.). It was also found that the GMMs lead to higher performance rates than the SVMs when the orientation angle is known. On the other hand, when the orientation angle is unknown, SVMs classify distance more accurately. A thorough comparison of the proposed method was performed with an earlier binaural distance estimation method developed by Vesa [12]. In this work, an additional modified version of the Vesa method has been also employed, where speech material (“Vesa-speech”) was used in the training stage instead of white noise (“Vesa-noise”). It was shown that the Vesa method can perform equally well for both training signals. The comparison results indicated that the proposed method presents performance rates of similar order to those of the Vesa method for a rough distance detection task, except for the case of a room that had high reverberation time and for which case the proposed method outperformed the Vesa method. For higher distance resolution, the proposed method performed moderately worse than the Vesa method for the two tested rooms, but it outperformed this baseline method for the most reverberant room. Note that, the Vesa method has been tuned to perform well in small rooms, thus it might be possible to perform well also in larger rooms with an appropriate tuning of the employed parameters. Finally, the ability of these methods to generalize and perform in rooms where they have not been trained in advance was examined. For the proposed method, the BSMD STD feature appeared again to be sufficient for a rough distance estimation task as the addition of more features did not lead to any increase of the performance. For rough distance estimation, the proposed method performs in most of the cases moderately better than the Vesa method, the only requirement being that the training and tested environment should have reverberation time of a similar order. For fine distance detection (seven distance classes), the proposed method achieved 20% higher performance rates than the Vesa method when time frames of 1 sec are used. Increasing the time frame length provided further benefits to the proposed method, achieving performance rates above 75% for both tested rooms. The performance of the Vesa method in unknown environments also increased for longer time frames, however it remained lower than this of the proposed method. In conclusion, the performance of the proposed method is similar compared to the Vesa method when the system is trained and tested in the same room. However, the proposed method has superior performance when the training and testing is carried out in different enclosures. Nevertheless, the Vesa method has the advantage that it requires less training material (i.e. 4 sec) than the proposed method (several minutes). This work introduces a distance detection method utilising cues and parameters of binaural signals. The method achieves improved performance for all tested head orientation angles. It allows also scalable implementation complexity since for coarse distance detection (e.g. three distance classes), it only requires a single novel feature (BSMD STD) to be calculated. For finer distance detection tasks (e.g. seven distance classes), the method requires up to nine additional features to be estimated from the binaural signals. Furthermore, the method has been shown to achieve higher performance rates in unknown

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

1739

TABLE V. SELECTED ORDER OF FEATURES USING THE MRMR METHOD FOR ROOMS B AND C AND VARIOUS ORIENTATION ANGLES

TABLE VI. SELECTED ORDER OF FEATURES USING THE HEUR METHOD FOR ROOMS B, C AND E

Fig. 11. Performance comparison of the proposed method for rooms B, C and E using the two classification approaches (SVMs, GMMs) and the two feature selection approaches (mRMR, HEUR) when the orientation angle is assumed to be known (a, c, e) and unknown (b, d). (a) Room B—Known angle. (b) Room B—Unknown angle. (c) Room C—Known angle. (d) Room C—Unknown angle. (d) Room C—Unknown angle.

rooms for increasing the processed time length of recorded data to a total of approximately 10 sec. Therefore, the robustness of this method, its scalable complexity and latency-dependent performance makes it suitable for many different applications related to sound source position detection (i.e. in wearable mobile devices), where very often the acoustical environment is completely unknown. APPENDIX II. DESCRIPTION OF THE mRMR AND HEUR FEATURE SELECTION APPROACHES A. mRMR Feature Selection Approach A detailed description on the use of the mRMR feature selection approach can be found in Section V-B. Table V presents the selected features using the recordings datasets of rooms B and C for all orientation angles, when the mRMR feature selection method is employed. B. Heuristic Feature Selection Approach In order to examine more thoroughly the contribution of each of the statistical features to the distance detection method separately, the heuristic (HEUR) feature selection approach was

used. Here, the distance detection performance rate was examined for each feature separately. Based on the methods performance using every time only one feature, the features that led to the highest performance rates were selected and used for the training of the classifiers. The feature that led to the highest performance was chosen first, then the feature leading to the second highest performance was chosen after, etc. This procedure was followed individually for each room. Table VI contains the selected feature order for rooms B, C and E using such HEUR approach. The datasets used for rooms B and C consisted of recordings taken for all orientation angles and for room E, the orientation angle was zero degrees. C. Comparison of the mRMR and HEUR Feature Selection Approaches In Fig. 11, the mean performance of the method is presented for the two feature selection approaches (HEUR and mRMR) using SVMs and GMMs for seven distance classes (fine distance detection). The results are presented for the case where the orientation angle is assumed to be known (the system is trained and tested with each orientation angle separately and the mean over all orientation angles is taken) and unknown (the system is trained and tested with all orientation angles at the

1740

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 8, AUGUST 2013

same time). Note that the orientation angle for room E was zero degrees (“known angle”). It is evident that higher performance can be achieved by employing the mRMR approach than the HEUR approach, thus the mRMR method was mostly used in this study. ACKNOWLEDGMENT The authors would like to thank Sampo Vesa for offering the matlab code of the distance detection method and the database. Furthermore, we thank the anonymous reviewers for their corrections, helpful comments and constructive suggestions to improve this paper. REFERENCES [1] V. Hamacher, J. Chalupper, J. Eggers, E. Fischer, U. Kornagel, H. Puder, and U. Rass, “Signal processing in high-end hearing aids: State of the art, challenges, and future trends,” EURASIP J. Appl. Signal Process., vol. 2005, pp. 2915–2929, 2005. [2] Computational Auditory Scene Analysis, D. F. Rosenthal and H. G. Okuno, Eds. Mahwah, NJ, USA: Lawrence Erlbaum Associates, 1998. [3] Computational Auditory Scene Analysis: Principles, Algorithms, and Applications, D. Wang and G. J. Brown, Eds. New York, NY, USA: Wiley-IEEE, Oct. 2006. [4] S. Oh, V. Viswanathan, and P. Papamichalis, “Hands-free voice communication in an automobile with a microphone array,” in IEEE Int. Conf. Acoust., Speech, Signal Process., Los Alamitos, CA, USA, 1992, vol. 1, pp. 281–284. [5] S. Gustafsson, R. Martin, and P. Vary, “Combined acoustic echo control and noise reduction for hands-free telephony,” Signal Process.—Spec. Iss. Acoust. Echo Noise Control, vol. 64, no. 1, pp. 21–32, 1998. [6] A. Härmä, “Ambient human-to-human communication,” in Handbook of Ambient Intelligence and Smart Environments. New York, NY, USA: Springer, 2009, pp. 795–823. [7] A. Tsilfidis, “Signal processing methods for enhancing speech and music signals in reverberant environments” Ph.D. dissertation, Univ. of Patras, Patras, Greece, 2011 [Online]. Available: http://www.wcl.ece.upatras.gr/audiogroup/publications/Tsilfidis_PhD.pdf [8] Y.-C. Lu and M. Cooke, “Binaural estimation of sound source distance via the direct-to-reverberant energy ratio for static and moving sources,” IEEE Trans. Speech Audio Process., vol. 18, no. 7, pp. 1793–1805, Sep. 2010. [9] Y. Hioka, K. Niwa, S. Sakauchi, K. Furuya, and Y. Haneda, “Estimating direct-to-reverberant energy ratio using d/r spatial correlation matrix model,” IEEE Trans. Speech Audio Process., vol. 19, no. 8, pp. 2374–2384, Nov. 2011. [10] M. Kuster, “Estimating the direct-to-reverberant energy ratio from the coherence between coincident pressure and particle velocity,” J. Acoust. Soc. Amer., vol. 130, no. 6, pp. 3781–3787, 2011. [11] P. Smaragdis and P. Boufounos, “Position and trajectory learning for microphone arrays,” IEEE Trans. Speech Audio Process., vol. 15, no. 1, pp. 358–368, Jan. 2007. [12] S. Vesa, “Binaural sound source distance learning in rooms,” IEEE Trans. Speech Audio Process., vol. 17, no. 8, pp. 1498–1507, Nov. 2009. [13] J. Blauert, Spatial Hearing—The Psychophysics of Human Sound Localization. Cambridge, MA, USA: MIT Press, 1996. [14] B. G. Shinn-Cunningham, N. Kopčo, and T. J. Martin, “Localizing nearby sound sources in a classroom: Binaural room impulse responses,” J. Acoust. Soc. Amer., vol. 117, no. 5, pp. 3100–3115, 2005. [15] J. van Dorp Schuitman, “Auditory modelling for assessing room acoustics,” Ph.D. dissertation, Technical Univ. of Delft, Delft, The Netherlands, 2011. [16] W. Hartmann, B. Rakerd, and A. Koller, “Binaural coherence in rooms,” Acta Acust. united with Acust., vol. 91, no. 3, pp. 451–462, 2005. [17] P. Bloom and G. Cain, “Evaluation of two-input speech dereverberation techniques,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP ’82), 1982, vol. 7, pp. 164–167. [18] M. Schroeder, “The statistics of frequency responses in large rooms, die statistischen parameter der frequenzkurven von grossen räumen (in German),” Acustica, no. 4, pp. 594–600, 1954.

[19] T. May, S. van de Par, and A. Kohlrausch, “A probabilistic model for robust localization based on a binaural auditory front-end,” IEEE Trans. Speech, Audio Process., vol. 19, pp. 1–13, Jan. 2011. [20] K. J. Ebeling, “Influence of direct sound on the fluctuations of the room spectral response,” J. Acoust. Soc. Amer., vol. 68, no. 4, pp. 1206–1207, 1980. [21] J. J. Jetzt, “Critical distance measurement of rooms from the sound energy spectral response,” J. Acoust. Soc. Amer., vol. 65, pp. 1204–1211, 1979. [22] E. Georganti, J. Mourjopoulos, and F. Jacobsen, “Analysis of room transfer function and reverberant signal statistics,” in Proc. Acoust. ’08, Paris, France, 2008. [23] M. Schroeder, “The “Schroeder frequency” revisited,” J. Acoust. Soc. Amer., vol. 99, no. 5, pp. 3240–3241, 1996. [24] H. Kuttruff, Room Acoustics, 3rd ed. Amsterdam, The Netherlands: Elsevier, 1991. [25] J. A. Gubner, Probability and Random Processes for Electrical and Computer Engineers. Cambridge, U.K.: Cambridge Univ. Press, 2006. [26] E. Georganti, T. May, S. van de Par, and J. Mourjopoulos, “On the statistics of binaural room transfer functions,” in Proc. 132th AES Conv., Budapest, Hungary, 2012. [27] B. Gourevitch and R. Brette, “The impact of early reflections on binaural cues,” J. Acoust. Soc. Amer., vol. 132, no. 1, pp. 9–27, 2012. [28] G. J. Brown and M. Cooke, “Computational auditory scene analysis,” Comput. Speech Lang., vol. 8, pp. 297–336, 1994. [29] C. Ludvigsen, “schaltungsaNordnung Für die Automatische Regelung von Hörhilfsgeräten [an algorithm for an automatic program selection Mode],” Germany Patent Nr. DE43 40817 A1, 1993. [30] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, Aug. 2005. [31] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics). New York, NY, USA: Springer, August 2006. [32] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using gaussian mixture speaker models,” IEEE Trans. Speech Audio Process., vol. 3, no. 1, pp. 72–83, Jan. 1995. [33] Netlab Neural Network Software Aston Univ., Engineering and Applied Science, 1999 [Online]. Available: http://www1.aston.ac.uk/eas/ research/groups/ncrg/resources/netlab/ [34] S. P. Lloyd, “Least squares quantization in PCM,” IEEE Trans. Inf. Theory, vol. IT-28, no. 2, pp. 129–137, Mar. 1982. [35] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support Vector Machines, 2001, [Online]. Available: www.csie.ntu.edu.tw/cjlin/libsvm [36] E. Georganti, T. May, S. van de Par, A. Harma, and J. Mourjopoulos, “Speaker distance detection using a single microphone,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 1949–1961, Sep. 2011. [37] Bang and Olufsen, Music for Archimedes, in Compact disc CD B&O 101, 1992. [38] M. Jeub, M. Schäfer, and P. Vary, “A binaural room impulse response database for the evaluation of dereverberation algorithms,” in Proc. Int. Conf. Digital Signal Process. (DSP), Santorini, Greece, 2009. [39] E. Georganti and J. Mourjopoulos, “Statistical relationships of room transfer functions and signals,” in Proc. Forum Acusticum, Aalborg, Denmark, 2011, pp. 1917–1921.

Eleftheria Georganti received her diploma from the department of Electrical Engineering and Computer Engineering, University of Patras, in 2007. In 2007 she started her Ph.D. working on “Modeling, analysis and processing of room transfer functions under reverberant condition” at the Audio and Acoustic Technology group of the Wire Communications Laboratory. She carried out several months of her research at the Technical University of Denmark (DTU) under the framework of Marie Curie Host Fellowships for Early Stage Research Training (EST) and another 9 months at the DSP group of Philips Research (Eindhoven, the Netherlands) working on ambient telephony technologies. She has spent three months at the Carl von Ossietzky University of Oldenburg and she is now at the final stage of her Ph.D. work. Her research interests include acoustical room responses modelling, auditory signal processing, psychoacoustics, auditory localization, room acoustics perception.

GEORGANTI et al.: SOUND SOURCE DISTANCE ESTIMATION IN ROOMS BASED ON STATISTICAL PROPERTIES OF BINAURAL SIGNALS

Tobias May received the Dipl.-Ing. (FH) degree in hearing technology and audiology from the Jade University of Applied Science, Oldenburg, Germany, in 2005 and the M.Sc. degree in hearing technology and audiology from the University of Oldenburg, Oldenburg, Germany, in 2007. In 2012, he obtained his Ph.D. degree from the Eindhoven University of Technology, Eindhoven, The Netherlands, in collaboration with the University of Oldenburg, Oldenburg, Germany. He is currently a Postdoctoral Researcher at the Centre for Applied Hearing Research, Technical University of Denmark, Denmark. His research interests include computational auditory scene analysis, binaural signal processing, and automatic speaker recognition.

Steven van de Par studied physics at the Eindhoven University of Technology, Eindhoven, The Netherlands, and received the Ph.D. degree from the Eindhoven University of Technology, in 1998, on a topic related to binaural hearing. As a Postdoctoral Researcher at the Eindhoven University of Technology, he studied auditory-visual interaction and was a Guest Researcher at the University of Connecticut Health Center. In early 2000, he joined Philips Research, Eindhoven, to do applied research in digital signal processing and acoustics. His main fields of expertise are auditory and multisensory perception, low-bit-rate audio coding and music information retrieval. He has published various papers on binaural auditory perception, auditory-visual synchrony perception, audio coding, and music information retrieval (MIR)-related topics. Since April 2010, he holds a professor position in acoustics at the University of Oldenburg, Oldenburg, Germany.

1741

John Mourjopoulos has obtained a B.Sc. degree in engineering from Coventry University in 1978, M.Sc. and Ph.D. degrees in 1980 and 1985, from the Institute of Sound and Vibration Research (ISVR), at Southampton University. Since 1986 he is working at the Electrical and Computing Engineering Department of the University of Patras, where he is now Professor of Electroacoustics and Digital Audio Technology and head of the Audio and Acoustic Technology group of the Wire Communications Laboratory. In 2000, he was a visiting professor at the Institute for Communication Acoustics at Ruhr-University Bochum, in Germany. John Mourjopoulos has authored and presented more that 100 papers in international Journals and conferences. He has worked in national and European projects, has organized seminars and short courses, has served in the organizing committees and as session chairman in many conferences and has contributed to the development of digital audio devices. He was awarded the Fellowship of the Audio Engineering Society (AES) in 2006. His research covers many aspects of digital processing of audio and acoustic signals, especially focusing on room acoustics equalization. He has worked on perceptually-motivated models for such applications, as well as for speech and audio signal enhancement. His recent research also covers aspects of the all-digital audio chain, the direct acoustic transduction of digital audio streams, WLAN audio and amplification. He is a member of the AES, of the IEEE and of the Hellenic Institute of Acoustics being currently its vice-president.

Suggest Documents