6'th International Symposium on Telecommunications (IST'2012)
Combination of Nested Microphone Array and Subband Processing for Multiple Simultaneous Speaker Localization Ali Dehghan Firoozabadi, and Hamid Reza Abutalebi Speech Processing Research Lab (SPRL), Electrical and Computer Eng. Dept., Yazd University Yazd, Iran Emails:
[email protected],
[email protected] have attracted much more attention in the last two decades, most of the works on multiple speaker localization have been done in recent decade.
Abstract— Speaker localization is one of the active topics in speech processing field. In this paper, we use a two-step method based on Time Difference Of Arrival (TDOA) for the localization of multiple simultaneous speech sources. In this method, directions of speakers are estimated by computing Generalized Cross Correlation (GCC) between microphone signals. In this paper, we propose a method based on combination of subband processing and nested microphone arrays. The use of subband processing is effective in increasing accuracy of multiple speaker localization. Also, the nested array can remove spatial aliasing by intelligent selection of some microphone subsets and assigning them to different subbands. When microphones of each subband were determined, subband processing is just applied on the data from that microphone subset. Moreover, targeting the high-noise environmental conditions, we use the GCC-Maximum Likelihood (GCC-ML) as the localization core of the proposed method. The combination of these all leads to omitting spatial aliasing and increasing the localization accuracy. Simulation results on different environmental scenarios validate the superior performance of the proposed method in the localization of multiple simultaneous speakers.
The approaches to multiple speaker localization are classified into two main categories. The first category consist of Time Difference Of Arrival (TDOA)-based methods which search for multiple source locations. Some of these methods are based on correlation and Cross Power Spectrum Phase analysis (CPSP) [4]. Also, there are some methods that estimate the locations of multiple speakers by use of Interaural Time Difference (ITD) [5]. Another sub-category consists the techniques that determine the speaker location based on Blind Source Separation (BSS) algorithms [6]. The second category of approaches to multiple speaker localization consists of the methods which are based on the emitted energy of the source and search the space to find the point/ direction with maximum energy emitting [7]. In this paper, we focus on the methods that use correlation function for estimating TDOA between pairs of microphones and then estimate the source direction based on the TDOAs. Generalized Cross Correlation (GCC) function is considered as the basis of the method. The use of popular PHAse Transform (PHAT) filter with GCC function (or so-called GCC-PHAT) increases the localization performance in reverberant environments; however the GCC-PHAT does not work well in noisy conditions [8]. In this research, considering the speaker localization in noisy or noisy-reverberant environments, we employ the GCC-Maximum Likelihood (GCC-ML) as a remedy.
Keywords- Speech Source Localization - Direction Of Arrival (DOA) - Time Difference Of Arrival (TDOA) - Subband Processing - Nested Microphone Array
I.
INTRODUCTION
Human's life mostly depends on working with various machines. Some of these machines require human interfaces and do their job based on the received signals from the user and surrounding environment [1]. Speech signal is one the main media for human-machine interaction; so the capture of high quality speech signal should be very important. Microphone arrays are widely used for signal acquisition in such machines. To reach its best performance, the beam-pattern of the microphone array should be directed toward the speech source location. Therefore, the source localization is essential for achieving high quality signal acquisition [2,3]. Also, sound source localization has many other applications such as environmental monitoring and surveillance.
On the other side, motivating by the inherent sparseness of speech signal in time-frequency domain, our proposed method works on the subbands of the microphone signals. Subband processing enables the localization of multiple simultaneous speakers based on the differences in energy distribution of multiple speakers in the subbands. In the microphone array signal processing, long inter-microphone distance results in spatial aliasing which in turn degrades the output signal in the higher frequency subbands. Consequently, the microphones with long distances decrease the localization accuracy. To overcome this problem, we propose the use of nested circular microphone array where the active frequency range of each
Speaker localization topic can be divided to two sub-topics: 1) single speaker localization, and 2) multiple speaker localization. Although single speaker localization methods
978-1-4673-2073-3/12/$31.00 ©2012 IEEE
907
GCC of the signals received at microphones l and q in block b R (τ ) is expressed as:
sub-array is inversely related to the distance between the microphones. Practically, by using nested arrays, in each subband a subset of the microphones are used instead of using all microphones in all bands. The subset includes those microphones which result no spatial aliasing in that subband.
(
1 Rlq ,b (τ ) = K
(1)
(
)
(3)
∑ k −0
ψ lq [ k ] Clq ,b [ k ] e
jk
2π τ K
(4)
d sin θ c
(5)
So, equation (4) can be expressed as a function of this angle: 1 Rlq ,b (θ ) = K
between source and mth microphone and vm ( t ) is the additive noise in mth microphone. However, to model the sound propagation more realistically, we should consider room impulse response (or reverberation effect). In this model, the microphone signal model will be:
Gs In equation (2), γ m d ( ) , t
K −1
τ=
distance between source and mth microphone, τ m is the delay
)
2π τ K
Physical distance between the microphones limits the range of time delays. Consider "end-fire" mode that the source is in the same axis of microphones l and q. This mode results in the d (d is the maximum amount of TDOA which is equal to c distance between microphones and c is sound velocity). Delay parameter τ can be defined as a function of angle/ Direction Of Arrival (DOA), θ , as:
where xm ( t ) is received signal by mth microphone, rm is the
(
k −0
jk
equation (3) with Clq ,b [ k ] resulting in:
The microphone signals can be modeled in two ways; in the first model, it is supposed that there is no reverberation in the environment and the received signal by microphone is delayed and attenuated version of source signal as:
Gs 1 s ( t − τ m ) * γ m d ( ) , t + vm ( t ) rm
∑
ψ lq [ k ] X l ,b [ k ] X q′ ,b [ k ] e
frequency weighting function. We replace X l ,b [ k ] X q′ ,b [ k ] in
GCC-ML FUNCTION FOR TDOA ESTIMATION IN NOISY ENVIRONMENTS
xm ( t ) =
K −1
In equation (3), K is length of DFT, X l ,b [ k ] is the DFT of the received signal at microphone l in frame b, ' is the conjugate operator, and ψ lq [ k ] is a discrete edition of
In section II we introduce GCC function, PHAT and ML filters as basic concepts for the proposed method. Also, ideal and realistic condition for microphone signal will be introduced. In section III, the proposed method based on the nested array, subband processing, and ML filter is explained. In section IV, simulation results in different environment scenarios are presented. Section V, includes the conclusion, and section VI explains our ongoing and future work in this field.
1 xm ( t ) = s ( t − τ m ) + vm ( t ) rm
)
1 Rlq ,b (τ ) = K
Overall, in the proposed method for multiple speaker localization in noisy environments, the GCC-ML is employed in frequency subbands in order to estimate the TDOA value. For each nested sub-array, the higher frequency range (higher active subband) is determined based on the inter-microphone distance. Experiment on different noisy and noisy-reverberant scenarios shows that the accuracy of the proposed method is significantly better than baseline (subband and fullband) methods.
II.
lq ,b
K −1
∑ k −0
ψ lq [ k ] Clq ,b [ k ] e
jk
2π d sin θ K c
0 ≤ θ ≤ π (6)
Assuming M simultaneous speakers, we can find TDOAs related to the speakers location by finding M highest maxima of the above equation:
(2)
θˆ1,2,..., M = arg max Rlq ,b (θ ) 0 ≤θ ≤π
is the impulse response
(7)
A. ML Weighting function ML weighting function is expressed as a function of (clean) source signal spectrum density, S [ k ] , and spectral densities of
between source and mth microphone. We will consider this real model in our research.
noise signals, Vl [ k ] and Vq [ k ] [9]:
GCC function is often implemented on framed signal. Input signal to array are segmented to small frames. Assuming that the source location is fixed during the time frame b, GCC can be expressed based on Discrete Fourier Transform (DFT). The
908
ψ lqML
S [k ]
⎧ S [k ] S [ k ] ⎫⎪ ⎪ + [k ] = ⎨1 + ⎬ Vl [ k ] Vq [ k ] ⎪ Vl [ k ] Vq [ k ] ⎪ ⎩ ⎭
corresponding subset. Combining the information gathered over all subbands and different microphone subsets, we will estimate the DOA of two simultaneous speakers. Figure 1 shows the diagram of the proposed method. A circular microphone array with eight similar microphones has been considered.
−1
(8)
This weighting function can be re-written in terms of the spectral densities of the received (microphone) signals ( X l [ k ] and X q [ k ] ) as follows:
ψˆ lqML [ k ] =
X l [k ] X q [k ] Vl [ k ]
2
2
X q [ k ] + Vq [ k ]
2
X l [k ]
2
(9)
where Vl [ k ] and Vq [ k ] are additive noise terms in the signals of microphones l and q which are estimated from silent parts of the signals. GCC function integrated by ML weighting filter is called GCC-ML. This function shows much better performance in noisy and noisy-reverberant conditions in respect to GCCPHAT. In this research, considering noisy and noisyreverberant situations as the target scenarios, we will use GCCML. Nevertheless, the localization accuracy of GCC-ML in reverberant condition is slightly worse than GCC-PHAT performance. Practically, the combination of nested arrays, subband processing method and GCC-ML function improves localization accuracy and help high accurate localization of simultaneous speakers in different environmental scenarios. PROPOSED METHOD BASED ON NESTED ARRAY AND SUBBAND PROCESSING To take advantage of the information in individual subbands of speech signal, and based on the inherent sparseness of speech signal in time-frequency domain, subband processing can be utilized in many speech processing applications, including sound source localization. While the conventional methods consider the whole signal spectrum identically in the localization procedure, the subband processing-based methods take advantage of the differences in the frequency bands of the mixed speech for the localization of multiple speakers. However, we still have a problem in subband processing method, which comes from the spatial aliasing effect. The problem is occurred in higher frequency subbands in the case of large inter-microphone distances. III.
Our proposed method is structured by combination of subband processing and nested arrays and utilization of GCCML function to increase estimations accuracy. Spatial aliasing effect can be eliminated by using nested microphone array. In the nested microphone array, a subset of the microphones is considered in each band (instead of using all microphones in all subbands). Linear arrays have been already used in the field of nested arrays for speech enhancement [10], but our aim is to generalize the concept of nested microphone arrays to the circular arrays and use it for the localization purpose.
Figure 1. Block digram of the proposed method by the use of nested circular array and subband processing.
A. Nested Circular Microphone Array Speech signal with the sampling frequency of 16000Hz is used in simulations. Nested array is designed so that covers frequency range of [50-7200]Hz through 4 sub-arrays. Each of these sub-arrays include 8 microphones (M=8). Sub-array 1 is designed for highest frequency band which is B1=[36007200]Hz. Considering the relation between wavelength and inter-microphone distance ( d < λ / 2 ), and according to the central frequency of this band (5400Hz), the minimum distance between the microphones of this sub-array should be less than 3.15cm. Sub-array 2 is designed for frequency range B2=[1800-3600]Hz. Microphone distance in sub-array 2 is 6.3cm (2*d=6.3cm). Sub-array 3 is designed for frequency range B3=[900-1800]Hz and microphone distance in this case is 12.6cm (4*d=12.6 cm). Sub-arrays 4 covers remaining frequency band B4=[50-900]Hz and the distance between microphones is 25.2cm (8*d=25.2 cm). We have arranged circular array according to above inter-microphone distances. We have set the distances between nearest microphones (i.e. microphone pairs (1,2), (2,3), (3,4), (4,5), (5,6), (6,7), (7,8), and (8,1)) to 3.25 cm. Also, we have chosen 6.5cm as the distance between the microphones in the next sub-array (i.e. microphone pairs (2,4), (4,6), (6,8), (8,2), (1,3), (3,5), (5,7), and (7,1)). The third sub-array of microphones have been established by the microphone with 9cm distance (i.e. microphone pairs (1,4), (2,5), (3,6), (4,7), (5,8), (6,1), (7,2), and (8,3)). Finally, we have chosen 10cm as the intermicrophone distance in fourth sub-array (i.e. microphone pairs (1,5), (2,6), (3,7), and (4,8)). B. Multi-rate Filterbank Each sub-array needs an analysis filterbank to prevent from spatial overlapping. Beside the smallest bandwidth covered by each array, the time resolution in sub-array is also affected by interpolator and decimator. Analysis filterbank ( H i ( z ) ) and
After selection of a microphone subset based on the nested array concept, subband processing is applied on the selected microphones. Finally, GCC-ML method is used between the subband signals from different microphone pairs of the
909
are again decomposed to four subbands by four linear-phase FIR filters: one LPF, one HPF and two BPFs are used in this case to produce 4-5kHz, 5-6kHz, 6-7kHz, and 7-8kHz bands. The outputs of H 2 ( z ) are decomposed to four subbands, too. The filters used in this step are like those ones used in previous step, except for their frequency ranges which are 2-2.5kHz, 2.5-3kHz, 3-3.5kHz, and 3.5-4kHz. The outputs of H 3 ( z ) are decomposed to two subbands with frequency ranges 1-1.5kHz and 1.5-2kHz using one LPF and one HPF. Finally, the outputs of H 4 ( z ) is also decomposed to two subbands with frequency ranges 0-0.5kHz and 0.5-1kHz by one LPF and one HPF.
decimator ( Di ) can be implemented in a multi-level tree structure shown in Figure 2.
The resulted subband signals (eight signals in each subband) are then used in computing subband GCC-ML for each microphone pair. After determining the position of the peaks of GCC-ML curves (DOA candidates), the histogram of DOA candidates are calculated for each microphone pair in each subband. Considering the number of microphone pairs of different subbands, we have 4 histograms for each of two lowest frequency subbands and 8 histograms for the other 10 subbands. This results in 88 histograms in total. The information of different histograms can be combined via several methods. Our experimental results have shown that the weighted averaging of the histograms is a proper choice in this regard.
Figure 2. Tree structure implementation for analysis filterbank.
Each level of this structure includes a High Pass Filter (HPF) ( HPi ( z ) ), a Low Pass Filter (LPF) ( LPi ( z ) ), and a decimator. LPFs and HPFs of different levels are related to the subband analysis filters ( H i ( z ) in Figure 1) as follows:
H1 ( z ) = HP1 ( z )
( ) ( z ) = LP ( z ) LP ( z ) HP ( z ) ( z ) = LP ( z ) LP ( z ) LP ( z )
H 2 ( z ) = LP1 ( z ) HP2 z 2 H3 H4
1
2
1
2
2
2
3
3
4
(10)
IV.
4
SIMULATIONS AND RESULTS
We have evaluated the proposed method in different environment conditions to show its superiority to other baseline methods. Our aim is to localize two simultaneous speakers. In this paper, we suppose that number of the speakers is specified. Simulations are done in the room with dimension of 6m*6m*4m. We have used an 8-microphone circular array for simulations. Array diameter is 10 cm.
The frequency response of the analysis filters are shown in Figure 3.
Figure 4 shows room plot and the location of speakers and microphones. We assume two simultaneous speakers are talking and we used from TIMIT database. First speaker is at (310,150,140)cm and the second one is at (240,150,140)cm. To model the sound propagation realistically, we use the Image algorithm [11] in our simulations. Having source location, microphones location, room reverberation time, room size, sampling frequency and surface reverberation coefficient, Image algorithm provides impulse response between source and each microphone. Then, we can calculate the received (microphone) signals by convolving these impulse responses to source signal. We use Adaptive White Gaussian Noise (AWGN) in room for noise modeling. In our simulations, the signals have been framed by 60ms, %50 overlapped Hamming windows. This length of window provides the required semistationarity of the signals. Simulations have been done on 5 second utterances from the speakers. It should also be mentioned that the reported results have been extracted from 10 consecutive frames.
Figure 3. Frequency response of the analysis filters.
C. Subband Sound Source Localization According to proposed block diagram in Figure 1 and the nested array structure and analysis filterbank, we can now explain the subband processing step. Each of eight received (microphone) signals passes through the analysis filters ( H i ( z ) , i = 1, 2,3, 4 ), producing 8 subband signals at the output of each filter. The frequency range of the signals at the outputs of four analysis filters is B1, B2, B3, and B4 (mentioned in sub-section III-A), respectively. The output signals of H1 ( z )
Experiments have been done on three different environmental scenarios. First scenario corresponds the reverberant environment. Reverberation effect is dominant in this scenario. Room Reverberation Time ( RT60 ) in this
910
the results for speakers 1 and 2, respectively. As shown in Figures 6 (a-b) in reverberant scenario, the proposed method is slightly better than the subband method. However, in noisy and noisy-reverberant scenarios the results of the proposed method is much better than those for subband and fullband methods.
scenario is 500ms ( RT60 = 500ms ) and SNR = 20dB . Second scenario is called noisy environment. In this scenario, room reverberation time and SNR has been considered as RT60 = 200ms and SNR = 5dB . Third scenario simulates noisy-reverberant environment, where both noise and reverberation are strongly present. In this scenario, room reverberation time and SNR are RT60 = 500ms and SNR = 5dB .
(a)
Figure 4. Room plot and the locations of speakers and microphone array. (b)
Speaker 1 and Speaker 2 have been located in the angles 40° and 68° respectively in accordance to their location in room. Based on block diagram of Figure 1, eight microphone signals enter analysis filterbank and after decimator, they are passed to the subband processing stage. In this stage, after applying subband filters, the GCC-ML function is calculated on microphone pairs of each subband. Finally, the fusion of the produced histograms is done. We use weighted averaging in fusion step. It should be mentioned that only a subset of all microphone pairs is considered in each subband; for example for the first subband, the GCC-ML is only computed for the microphone pairs (1,2), (2,3), (3,4), (4,5), (5,6), (6,7), (7,8), and (8,1). Similarly, for the other subbands, the process is done only on the corresponding microphone pairs.
Figure 5. The histogram of DOA candidates produced by the estimations at the output of (a) the LPF in output of analysis filterbank H 3 ( z ) , (b) the HPF in output of analysis filterbank H 3 ( z ) .
Figure 5 shows histogram of the DOAs produced by LPF and HPF outputs that are palced in the branch of analysis filterbank H 3 ( z ) (see Figure 1). As shown in Figures 5(a-b), most of the estimated angles are around the DOA of the first speaker, which in turn shows that this band mainly contains the spectral content of the first speaker.
(a)
Then, we compare the proposed method with the subband and fullband counterparts. Both fullband and subband methods are based on same microphone array used in this research. In subband method, we do not consider the nesting of the array and use the standard circular array; the microphone signals are decomposed to eight uniform subbands; then the GCC-ML are calculated and the DOA histograms are computed. In the fullband counterpart, the GCC-ML calculation and histogram computation is done on the fullband microphone signals. For these thee methods, correct estimations percentages have been calculated in different acoustic conditions. Figures 6(a-b) show
(b)
911
Figure 6. Correct estimation percentage for different scenarios for (a) speaker 1, and (b) speaker 2
Since the considered methods are based on the structure of the implemented array, their performance is related to the interspeakers distance. Figures 7(a-b) show the correct estimations percentage in terms of the distance between two speakers. In these figures, we have compared the results for above three methods. As shown in Figures 7(a-b), the use of nested array with subband processing leads to superior localization accuracy in different values of inter-speakers distance. As the interspeakers distance increases, the performance of all three methods improves; however this improvement is saturated in the case of large distances between speakers. Noisy-Reverberation Environment
for
speaker 1
90
Percentage of Correct Estimation
80 70 60
40 30 20
0
Sub Band + Nested Array Sub Band Full Band
0
10
20
30
40
50
60
70
80
REFERENCES [1]
H. Nakashima, T. Mukai, “3D Sound Source Localization System Based on Learning of Binaural Hearing,” in Proc. IEEE SMC 2005, pp. 35343539, 2005. [2] Y. Sasaki, et al, “2D Sound Localization from Microphone Array Using a Directional Pattern,” in The 25th Annual Conference of The Robotics Society of Japan, 2007. [3] M. Brandstein and D Ward, Microphone Arrays, Springer Verlag, 2001. [4] T. Nishiura, T. Yamada, S. Nakamura, and K. Shikano, “Localization of multiple sound sources based on a CSP analysis with a microphone array,” in Proc. ICASSP, volume 1, pages 1053-1056, Istanbul, Turkey, 2000. [5] S. Y. Lee, and H. M. Park, “Multiple Reverberant Sound Localization Based on Rigorous Zero-Crossing-Based ITD Selection,” IEEE Signal Processing Letters, Vol. 17, No. 7, July 2010. [6] M. S. Pederson, J. Larsen, U. Kjems, and L. C. Parra, “A survey on convolutive blind source separation methods,” in Springer Handbook of speech (Editor: J. Benesty, Y. Huang, and M. Sondhi), November 2007. [7] G. Lathoud, and I. A. McCowan, “A Sector-Based Approach For Localization of Multiple Speakers with Microphone Arrays,” in Workshop of Statistical and Perceptual Audio Processing (SAPA-2004), Jeju, Korea, Oct. 2004. [8] A. Dehghan, and H. R. Abutalebi, “SRP-ML: A Robust SRP-based speech source localization method for Noisy environments,” in Proc. of the 18th Iranian Conference on Electrical Engineering (ICEE), Isfahan, Iran, May 2010. [9] X. Lai and H. Torp, “Interpolation methods for time delay using cross correlation for blood velocity measurement,” IEEE Trans.Ultrasonics, Ferroelectrics, and Frequency Control, vol. 46, no. 2, pp. 277-290, 1999. [10] Y.R. Zheng, R.A. Goubran, and M. El-Tanany, "Experimental Evaluation of a Nested Microphone Array With Adaptive Noise Cancellers," IEEE Transactions on Instrumentation and Measurment, Vol. 53, No. 3, June 2004. [11] J. Allen, D.Berkley, “Image method for efficiently simulating smallroom acoustics,” Journal of the Acoustical Society of America, vol. 65, no. 4, pp. 943–950, 1979.
(a) for
Speaker 2
90
Percentage of Correct Estimation
80 70 60 50 40 30 20
Sub Band + Nested Array Sub Band Full Band
10
0
0
10
20
30
40
50
60
70
80
90
Degree Between Two Sources
(b) Figure 7. Correct estimation percentage per distance between two speakers for (a) speaker 1, and (b) speaker 2.
V.
FUTURE WORK
One of the ideas for the future work is that we firstly estimate the SNR and RT60 for each subband. Then, we can use GCC-ML function for those with high SNR and high RT60 amounts and employ GCC-PHAT function for those with low SNR and high RT60 . This may cause to have high localization accuracy in reverberant environment.
90
Degree Between Two Sources
Noisy-Reverberation Environment
VI.
As another idea for the future work, the GCC-based localization core of the method can be replaced with other alternatives such as Average Directivity Pattern (ADP) or State Coherence Transform (SCT), that extract the cross information between the microphones.
50
10
subband signals. Finally, the histograms have been fused by weighted averaging method. Experimental results in noisy and noisy-reverberant scenarios show that the correct estimation percentage of the proposed method is significantly better than those for subband and fullband methods. In reverberant environment, the performance of the proposed method is slightly better than the other methods. Also, the localization accuracy of the method is improved as the inter-speaker distance is increased.
CONCLUSION
In this paper, we proposed the combination of nested arrays, subband processing, and GCC-ML function for the localization of multiple simultaneous speakers. In fact, nested array is used to eliminate spatial aliasing effect. We can assign special microphones to each subband by the use of nested array and analysis filterbank. Then subband processing is done on the signals from assigned microphones. GCC-ML function has been used to calculate the cross correlation between the
912