Audio Engineering Society. Convention Paper. Presented at the 138th Convention. 2015 May 7â10 Warsaw, Poland. Direction of Arrival Estimation of Multiple.
Audio Engineering Society
Convention Paper 9299 Presented at the 138th Convention 2015 May 7–10 Warsaw, Poland
This paper was peer-reviewed as a complete manuscript for presentation at this Convention. This paper is available in the AES E-Library, http://www.aes.org/e-lib All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society.
Direction of Arrival Estimation of Multiple Sound Sources Based on FrequencyDomain Minimum Variance Distortionless Response Beamforming Seung Woo Yu1, Kwang Myung Jeon1, Dong Yun Lee1, and Hong Kook Kim1,2 1
School of Information and Communications, Gwangju Institute of Science and Technology (GIST) Gwangju 500-712, Korea {yuseungwoo, kmjeon, ldy, hongkook}@gist.ac.kr 2
Dept. of Electrical and Computer Engineering, City University of New York, NY 10031, USA
ABSTRACT In this paper, a method for estimating the direction-of-arrivals (DOAs) of multiple non-stationary sound sources is proposed on the basis of a frequency-domain minimum variance distortionless response (FD-MVDR) beamformer. First, an FD-MVDR beamformer is applied to multiple sound sources, where the beamformer weights are updated according to the surrounding environments for the reduction of the sidelobe effect of the beamformer. Then, multistage DOA estimation is performed to reduce computational complexity regarding the beam search. Finally, a median filter is applied to improve the DOA estimation accuracy. It is demonstrated that the average DOA estimation error of the proposed method is smaller than those of the methods based on conventional GCC-PHAT, MVDR-PHAT, and FD-MVDR, with lower computational complexity than that of the conventional FD-MVDRbased DOA estimation method.
1.
INTRODUCITON
There are many acoustic array applications, such as sound localization, source separation, and direction estimation. Among them, direction-of-arrival (DOA) estimation has been applied to many areas, such as radar, communication, sonar, and aeronautics [1]. To
this end, the generalized cross correlation phase transform (GCC-PHAT) and steered response power phase transform (SRP-PHAT) have been proposed [2], and they were known to have robust performance in reverberant environments [3]. However, these methods focused on tracking one sound source that had the greatest phase power depending on the correlation of each microphone channel [5].
Yu, et al.
DOA Estimation of Multiple Sound Sources
In order to estimate the directions of multiple sound sources that are recorded simultaneously, beamformingbased DOA estimation methods have been proposed. Among them, the minimum variance distortionless response phase transform (MVDR-PHAT) [5, 6] minimized the total acoustic power while the directional gain of the desired sound source was set through the phase transform with fixed MVDR-PHAT weights [7]. However, the MVDR-PHAT beamformer was not suitable for detecting the directions of non-stationary sound sources, even though all the possible directions were exhaustively searched [7]. To remedy this problem, a frequency-domain minimum variance distortionless response (FD-MVDR)-based DOA estimation method was proposed [8, 9], where the covariance matrix for the beamformer weights was updated by taking into account environmental information to estimate the direction of non-stationary sound sources. However, the burden of computational complexity was higher due to the weight update and beam search. In this paper, a computationally efficient DOA estimation method was proposed on the basis of FDMVDR. In other words, the proposed method applies FD-MVDR to multiple non-stationary sound sources and a multi-stage DOA estimation approach is performed to reduce computational complexity of the beam search. After that, a median filter [10] is applied to improve the DOA estimation accuracy. Following this introduction, Section 2 describes the proposed DOA estimation method, including the weight updating procedure of the FD-MVDR, multi-stage DOA estimation, and the application of a median filter. Section 3 evaluates the performance of the proposed DOA estimation method applied to multiple nonstationary sound sources and compares it with those of conventional GCC-PHAT, MVDR-PHAT, and FDMVDR. Section 4 concludes the paper. 2.
PROPOSED DOA ESTIMATION METHOD
z m (n ) z M -1 ( n )
z0 (n )
STFT Z i (ω ) Covariance matrix adaptation ΦZ ,i (ω) FD-MVDR weight adaptation Wˆ MVDR ( ω ) Multi-stage DOA estimation θˆi , j
d (ω, θ )
Median-filtering
θi , j
Fig. 1. Block diagram of the proposed FD-MVDR based DOA estimation method.
frequency domain signal, Z m ( ), using a short-term Fourier transform (STFT), in which the number of frequency bins for STFT is determined based on spatial sampling theory. A covariance matrix, Z ( ), is then calculated to update the FD-MVDR weight matrix, W ( ), once every frame. Next, a multi-stage approach is applied for the DOA estimation. That is, a region in which a sound source is located is identified, and then a detailed search is performed on the region. As a result, the DOA at the i-th frame for the j-th sound source,
ˆi , j , is estimated. Finally, a median filter is applied to ˆi, j to reduce the estimation error. 2.2. FD-MVDR beamformer
2.1. Overview
For a given j-th sound source, S j ( ), the propagated
Fig. 1 shows a block diagram of the proposed FDMVDR-based DOA estimation. As shown in the figure, the proposed method first converts the time domain signal of the m-th microphone, z m (n ), into the
signal recorded by an M-th linear microphone array can be expressed in the frequency domain as [8, 9]
Z i () d (, )S j () N ()
AES 138th Convention, Warsaw, Poland, 2015 May 7–10 Page 2 of 6
(1)
Yu, et al.
DOA Estimation of Multiple Sound Sources
where N ( ) is the background noise and d (, ) is the steering vector. In addition, Z i ( ) is the (Mⅹ1)dimensional column vector obtained by concatenating all the microphone signals at the i-th frame, such as T Z i ( ) [ Z 0 ( ),, Z m ( ),, Z M 1 ( )]
(2)
where T is the transpose operator and Z m ( ) is the STFT of the input signal from the m-th microphone. In Eq. (1), the steering vector is represented as [8, 9]
j ( ) j ( ) j ( ) T d (, ) [e 0 e m e M 1 ]
(3)
1
where m ( ) f s c l0,m sin( ) is the time delay that
where denotes the conjugate transpose. In Eq. (6), Z m ,i ( ) [ Z m,i F 1 ( ),, Z m,i1 ( ), Z m,i ( )] is a block matrix using (F-1) previous frames and the current i-th frame, and P is a regularization factor to prevent the covariance matrix from being singular. In order to realize the FD-MVDR with reduced complexity, we need to select a minimum frequency for searching for the DOA frequency. For a given minimum wavelength of the sources that can be separable in the linear microphone array, min , we set the minimum frequency as spatial 2f min / f s , where f min c min . In this paper, we set the minimum frequency as spatial 2f min / f s , where f min c min . In this paper,
arises between the two adjacent microphones. In addition, f s , c, and lu ,v are the sampling rate, the
spatial 0.2761 for f min 2109 Hz and N fft 2048.
speed of sound in the air, and the microphone spacing between the u-th and v-th microphones, respectively.
2.3. DOA estimation
The FD-MVDR attempts to give a weight to the input signals according to the following equation of
In order to accelerate the DOA search, we first identify the region in which a sound source is assumed to be located. That is, we divide the whole space, I ( [90,90]), into R different regions, such as
ˆ WMVDR ( ) arg min W H ( ) Z ( )W ( )
(4)
I r (r 1) 90 r 90, r 1,, R
W
ˆ where WMVDR() is minimized with a constraint of H W ()d (, ) 1 [8]. By applying a Lagrange multiplier to Eq. (4), the weights for the FD-MVDR beamformer are obtained as ˆ Z1 ( )d (, ) WMVDR ( ) H d (, ) Z1 ( )d (, )
(7)
where 180 / R is the interval of the separated region. In this case, the central angle, Cr , of Ir is defined as Cr ( r 1 / 2), r 1, , R. We then select the r that provides the maximum average of i PMVDR (Cr ), such as
(5) i r o arg max PMVDR ( C r )
(8)
1 r R
Therefore, the proper estimation of the covariance matrix, Z ( ), in Eq. (4) or (5) plays a crucial role in the performance of the FD-MVDR. Since the aim of this paper is to estimate the DOA of a target sound source in non-stationary environments, we need to update Z () using the microphone signals for the i-th frame, such as 1F PF Z1,i ()Z1,i () Z1,i ()ZM,i () F i 1 F i1 z,i () 1 F PF ZM,i ()Z1,i () ZM,i ()ZM,i () F i1 F i1
(6)
1 i d . In fact, we where PMVDR( ) H -1 d (, ) ()d (, ) Z ,i
use a smaller number of microphones than M, because coarse spatial resolution improves region identification. In the ro-th region selected from Eq. (8), the FD-MVDR beamforming with M microphones is conducted because the narrow beam pattern is advantageous for estimating DOA. i ˆi , j arg max PMCDR ( )
∈I o r
AES 138th Convention, Warsaw, Poland, 2015 May 7–10 Page 3 of 6
(9)
DOA Estimation of Multiple Sound Sources
For estimating the DOAs of multiple sources, the procedure defined by Eqs. (8) and (9) is repeated for all the J sound sources.
Baby crying
amplitude
Yu, et al.
0.2 0 -0.2 50
100
150
200
In order to reduce the DOA estimation errors, a smoothing technique using a median filter is applied here. In particular, the median filtering is applied once every frame by shifting one frame with a window size of 11. As boundary conditions, the values for the first and last frames are repeated. Consequently, a median filtered version of θˆi , j , θi , j is obtained.
i,j
50
100
150
200
350
i,j
250
300
350
400 450 frame
100
150
200
250
300
350
400 450 frame
250
300
350
400
(c) 90 0 -90 100
150
200 (d)
The performance of the proposed method was compared with those of the conventional GCC-PHAT [2], MVDRPHAT [5], and FD-MVDR methods [9]. The evaluation database was recorded in an anechoic room of approximately 12m² with 30 different environmental sound sources, including TV news, air conditioning, and cooking sounds. The test database was composed of two different scenarios—one source and two sources, which were referred to as Test1 and Test2, respectively. The data was recorded at a 48-kHz sampling rate with a 16bit resolution. All the methods, including the proposed method, were designed to handle 2048 samples per frame, overlapping with half of the previous frame. The uniform linear microphone array was composed of six electret condenser microphones and the distance of each mic was l 4.2 cm . Thus, we had ltotal 21 .0cm. The performance of the DOA estimation was measured by using a mean absolute error (MAE), , and the real time factor (RTF). In other words, MAE was defined as [11]
i,j
450
frame
90 0 -90 50
100
150
200
250
300
350
400 450 frame
250
300
350
400 450 frame
(e)
i,j
450
90 0 -90 50
i,j
400
frame
(b)
PERFORMANCE EVALUATION
1 I J 0 i , j i , j I J i1 j1
300
90 0 -90
50
250 (a)
2.4. Median filtering
3.
Cat
90 0 -90 50
100
150
200 (f)
Fig. 2. Comparison of DOA estimation performance for (a) a given waveform, (b) reference DOA, and DOAs by (c) GCC-PHAT, (d) MVDR-PHAT, (e) FD-MVDR, and (f) the proposed method.
RTF
DO DI
(11)
where DI was the length of the input audio signals in seconds, and DO was the time elapsed for estimating the DOAs. In fact, the RTF was measured with an Intel(R) Core i7-4790K CPU with a clock cycle of 4 GHz and a RAM size of 32 GB, which operated at the Windows 7 64-bit operating system.
(10)
where I and J represented the numbers of frames and sound sources, respectively. In Eq. (10), θi0, j and θi, j were the reference and estimated DOAs, respectively, at the i-th frame and the j-th source. Additionally, RTF was defined as
Fig. 2 shows the estimated DOAs of different estimation methods applied two sound sources (Test2 scenario) shown in Fig. 2(a). Fig. 2(b) shows the reference DOA for each sound source, and Figs. 2(c)–(f) show the estimated DOAs by the GCC-PHAT, MVDR-PHAT, FD-MVDR, and the proposed method, respectively. As shown in the figure, MVDR-PHAT provided a superior performance to GCC-PHAT, but MVDR-PHAT failed
AES 138th Convention, Warsaw, Poland, 2015 May 7–10 Page 4 of 6
Yu, et al.
DOA Estimation of Multiple Sound Sources
to estimate the DOAs for non-stationary sound sources, such as air-conditioning noise. On the other hand, FDMVDR outperformed MVDR-PHAT, and the proposed FD-MVDR based estimation method could give DOAs similar to the reference DOAs. Tables I and II compare MAE between different DOA methods at the scenarios of Test1 and Test2, respectively. It was shown from the tables that GCCPHAT achieved smaller MAE at Test1 scenario than Test2 scenario. Moreover, the proposed method achieved a similar performance to FD-MVDR and a smaller MAE than GCC-PHAT and MVDR-PHAT. Table III compares average RTF between GCC-PHAT, MVDR-PHAT, FD-MVDR, and the proposed method. It was shown from the table that the proposed method was faster than the FD-MVDR method. Thus, it could be concluded here that the proposed method obtained the better accurate DOAs of multiple sources than FDMVDR, whereas its complexity was lower than that of FD-MVDR.
4.
CONCLUSION
In this paper, an FD-MVDR based multi-stage DOA estimation was proposed to reduce the computational complexity for DOA estimation, compared to conventional FD-MVDR based DOA estimation method. The performance of the proposed method was evaluated in terms of average mean squared error of the estimated DOA and the processing time. It was shown from the evaluation that the proposed method was faster than FD-MVDR with smaller average mean squared error.
5.
ACKNOWLEDGEMENTS
This work was supported in part by a National Research Foundation of Korea (NRF) grant funded by the Ministry of Science, ICT & Future Planning (MSIP) (No. 2012-010636), ICT R&D program of MSIP/IITP [2014- 044-055-002, Loudness Based Broadcasting Loudness and Stress Assessment of Indoor Environment Noises], and by the MSIP under the ITRC (Information Technology Research Center) support program (NIPA2014-H0301-14-1019) supervised by the NIPA (National IT Industry Promotion Agency).
TABLE I Comparison of MAE between different DOA estimation methods at Test1 scenario Degree (°) -60 -30 0 30 60 Avg.
GCCPHAT 18.77 6.92 2.35 6.03 8.72 8.55
MVDRPHAT 19.99 8.10 1.58 7.72 12.45 9.96
FDMVDR 20.27 7.22 4.93 5.52 8.99 9.38
Proposed 16.59 7.12 6.41 6.28 8.18 8.91
TABLE II Comparison of MAE between different DOA estimation methods at Test2 scenario Degree (°) -60, 45 -45, 30 -30, 15 -20, 0 -5, 5 Avg.
GCCPHAT 43.44 29.77 18.43 13.24 17.25 24.42
MVDRPHAT 44.38 31.92 19.88 13.66 26.09 27.18
FDMVDR 47.33 33.66 19.02 9.72 5.97 23.14
Proposed 44.98 35.12 19.52 11.37 9.23 24.04
TABLE III Comparison of RTF between different DOA estimation methods at each scenario Scenario Test1 Test2 Avg.
6.
GCCPHAT 4.07 4.06 4.06
MVDRPHAT 23.96 21.98 22.97
FDMVDR 29.86 27.21 28.53
Proposed 13.66 16.29 14.97
REFERENCES
[1] Z. Xiaofei, et al., “A novel DOA estimation algorithm based on eigen space,” in Proc. of IEEE Int. Symp. Microwave, Antenna, Propag., EMC Technol. Wireless Commun., pp. 551-554 (2007). [2] M. F. Font, Multi-microphone Signal Processing for Automatic Speech Recognition in Meeting Rooms, MS Thesis, Universitat Politecnica de Catalunya, Spain (2005). [3] K. C. Kwak and S. S. Kim, “Sound source localization with the aid of excitation source information in home robot environments,” IEEE
AES 138th Convention, Warsaw, Poland, 2015 May 7–10 Page 5 of 6
Yu, et al.
DOA Estimation of Multiple Sound Sources
Transactions on Consumer Electronics, vol. 54, no. 2, pp 852-856 (2008). [4] C. J. Chun and H. K. Kim, “Sound source separation using interaural intensity difference in real environments,” 136th Audio Engineering Society Convention, New York, NY, preprint 8976, (2013). [5] H. Do and H. F. Silverman, “Robust crosscorrelation-based techniques for detecting and locating simultaneous, multiple sound sources,” in Proc. of ICASSP, pp. 201-204 (2012). [6] M. Brandestein and D. Ward, Microphone Arrays Signal Processing Techniques and Application, Springer-Verlag: Berlin, Germany (2001). [7] J. J. M. Van de Sande, “Real-time beamforming and sound classification parameter generation in public environments,” TNO Report, TNO-DV 2012 S007 (2012). [8] M. E. Lockwood, et al., “Effect of multiple nonstationary sources on MVDR beamformers,” in Proc. of 37th Asilomar Conf. on Signals, Systems and Computers, pp. 730-734 (2003). [9] M. E. Lockwood, et al., “Performance of time- and frequency-domain binaural beamformers based on recorded signals from real rooms,” Journal of the Acoustical Society of America, vol. 115, no. 1, pp. 379-391 (2004). [10] W. K. Pratt, Digital Image Processing, 4th Ed, John Wiley & Sons: Hoboken, NJ (2007). [11] R. J. Hyndman and A. B. Koehler, “Another look at measures of forecast accuracy,” International Journal of Forecasting, vol. 22, no. 4, pp 679-688 (2006).
AES 138th Convention, Warsaw, Poland, 2015 May 7–10 Page 6 of 6