ON THE USE OF MACHINE LEARNING IN MICROPHONE ARRAY

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 13–16, 2016, SALERNO, ITALY

ON THE USE OF MACHINE LEARNING IN MICROPHONE ARRAY BEAMFORMING FOR FAR-FIELD SOUND SOURCE LOCALIZATION Daniele Salvati, Carlo Drioli, Gian Luca Foresti Department of Mathematics, Computer Science and Physics, University of Udine ABSTRACT This paper presents a weighted minimum variance distortionless response (WMVDR) algorithm for far-field sound source localization in a noisy environment. The broadband beamforming is computed in the frequency-domain by calculating the response power on each frequency bin and by fusing the narrowband components. A machine learning method based on a support vector machine (SVM) is used for selecting only the narrowband components that positively contribute to the broadband fusion. We investigate the direction of arrival (DOA) estimation problem using a uniform linear array (ULA). The skewness measure of response power function is used as input feature for the supervised SVM learning. Simulations demonstrate the effectiveness of the WMVDR in an outdoor noisy environment. Index Terms— Weighted minimum variance distortionless response, far-field sound source localization, microphone array, support vector machine, machine learning. 1. INTRODUCTION Acoustic source localization is of interest in many applications such as audio surveillance, system monitoring, and scene analysis in outdoor noisy environments [1] [2] [3] [4] [5] [6] [7] [8], or teleconferencing systems, musical control interfaces, and medical intervention in indoor reverberant environments [9] [10] [11] [12]. Beamforming is a robust method for source localization, which aims at estimating the source position by maximizing the steered response power (SRP) output of the spatial filter in the source direction. The conventional data-independent beamformer [13] is based on a delay-and-sum procedure, which has its roots in time-series analysis. In acoustic applications, the broadband SRP is in general computed by calculating the response power on each frequency bin and by fusing the narrowband components. The goal of a spatial filter is to leave undistorted the signal with a given DOA and to attenuate the response power for all the other directions. The minimum variance distortionless response (MVDR) [14] beamformer is a data-dependent filter which is aimed at minimizing the energy of noise and sources coming from c 978-1-5090-0746-2/16/$31.00 2016 IEEE

different directions, while keeping a fixed gain on the desired DOA. For increasing the spatial resolution of the broadband SRP, usually a normalization of narrowband power maps is computed before the fusion of the maps. The SRP phase transform (SRP-PHAT) algorithm [15] considers only the phase information to compute the normalization. In [6], it is shown that a post-filter normalization of each narrowband power map substantially improves the spatial resolution for broadband sources of the minimum variance distortionless response (MVDR) [14] beamformer, which is more robust against noise if compared to other algorithms. Hence, the normalization provides an high-resolution broadband spatial filter. Unfortunately, the normalization has the disadvantage of emphasizing the noise in those frequencies in which the signal-to-noise ratio (SNR) is low, resulting in large errors that may cause an inaccurate final frequency data combination. Recently, a SRP weighted MVDR (SRP-WMVDR) was proposed in [16], in which a machine learning approach for selecting narrowband components was introduced, using a radial basis function network (RBFN) classifier and the marginal distribution of the narrowband components as input. The approach in [16] was extended in [17], in which a SVM learning component, which outperforms the RBFN, and statistical features of the marginal distributions were used. If compared to [16] and [17], in which the problem of near-field sound localization in reverberant environments was considered, the present paper discusses the application of the method to far-field sound localization. In this paper, we study the problem of far-field sound localization in outdoor noisy environments and we investigate the use of a machine learning for direction of arrival (DOA) estimation problem using an uniform linear array (ULA). We use the skewness measure of response power function as input feature for the supervised learning since it has been demonstrated in [17] that it is an effective feature for identifying the constructively and disruptively contributing of SRP narrowband functions. We provide simulations for the DOA estimation of a sound car signal in noise conditions using an ULA of four sensors. The machine learning component is trained with an USASI noise for a generalization of the method as in [17].

2. THE SRP-WMVDR ALGORITHM FOR DOA ESTIMATION Consider a single source that impinges upon an array of N sensors in a ULA configuration with an angle θs and let s(t) denotes the signal generated by a sound source at time t. The output of the nth (n = 1, 2, . . . , N ) sensor in a free-field noisy environment can be expressed as xn (t) = δn s(t − τn ) + vn (t)

(1)

where δn is the attenuation of the sound propagation (inversely proportional to the distance from source s(t) to microphone n), τn is the propagation time from the source to the nth sensor, vn (t) is an additive noise, which is assumed to be uncorrelated. Given a block signal vector of length L at time t xn (t) = [xn (t), xn (t − 1), . . . , xn (t − L + 1)]T

(2)

where (·)T denotes the transpose operator, the frequencydomain transformation of the nth received signal is given by

where c is the speed of wave propagation and d is the distance between microphones for the ULA. The MVDR beamformer [14] is a data-dependent spatial filter technique which is aimed at minimizing the energy of noise and sources coming from different directions, while maintaining constant the gain on the desired direction. The MVDR filter using a diagonal loading regularization [18] [19] [20] [21] [22], which is a popular approach to numerical stability improvement, relies on the solution of the following minimization problem minimize wH (f, θ)(Φ(f ) + µI)w(f, θ) subject to wH (f, θ)a(f, θ) = 1

where I is the identity matrix, and the data-dependent µ factor is given by 1 (9) µ = tr[Φ(f )]∆ N where ∆ is the loading constant, and tr[·] denotes the sum of the elements on the main diagonal of the PSD matrix. Solving (8) using the method of Lagrange multipliers, we obtain w(f, θ) =

Xn (f ) =

L−1 X

xn (t − k)e

−2πjf k L

,

f = 0, 1, . . . , L − 1 (3)

k=0

where j is the imaginary unit. In vector notation, the data model of the frequency-domain observed vector for N sensors can be expressed as x(f ) = [X1 (f ), X2 (f ), . . . , XN (f )]T .

(4)

The output of a beamformer Y (f, θ) for frequency f in the look direction θ is obtained by weighting and summing the sensor signals Y (f, θ) = wH (f, θ)x(f )

(5)

where the superscript H represents the Hermitian (complex conjugate) transpose, and w(f, θ) is a vector for weighting and steering the data in the direction θ. Then, the power spectral density (PSD) of the spatially filtered signal is P (f, θ) = E{|Y (f, θ)|2 }

(6)

= wH (f, θ)Φ(f )w(f, θ)

where Φ(f ) = E{x(k, f )xH (k, f )} is the PSD matrix of the microphone signals, and E{·} denotes mathematical expectation. We now select the first sensor (n = 1) as the reference sensor, and we assume that all the sensors are omnidirectional and identical, then the array steering vector for an ULA in the far-field can be expressed as a(f, θ) = [1, e

−j2πf d sin(θ) cL

,...,e

−j2πf d sin(θ)(N −1) cL

]T

(7)

(8)

(Φ(f ) + µI)−1 a(f, θ) . aH (f, θ)(Φ(f ) + µI)−1 a(f, θ)

(10)

Hence, the PSD of the regularized MVDR beamformer is given by PMVDR (f, θ) =

1 . aH (f, θ)(Φ(f ) + µI)−1 a(f, θ)

(11)

The broadband PSD can be formalized by a normalized incoherent frequency fusion [6] defined as PSRP-NMVDR (θ) =

fmax X f =fmin

PMVDR (f, θ) max[PMVDR (f, θ)]

(12)

θ

where fmin and fmax denote the frequency range of the broadband source, and max[·] denotes the maximum value. The normalization lends an high resolution to the spatial spectrum, but emphasizes the noise in those narrowband beamformers in which the SNR ratio is low, thus providing a misleading contribution to the fusion. To avoid using this disruptive information, the WMVDR uses the weighting factors that are modeled by an SVM classifier PSRP-WMVDR (θ) =

fmax X γf + 1 PMVDR (f, θ) (13) 2 max[PMVDR (f, θ)]

f =fmin

θ

where γf are binary variables, which take values -1 or 1, and are estimated with the SVM supervised model defined as Q X γf = sgn αi γ i ψ(ς i , ς(f )) + b i=0

(14)

where Q is the training sample size, ψ(ς i , ς(f )) is the innerproduct kernel for the i-th training sample input ς i and the sample input ς(f ) for the narrowband PSD at frequency f , γ i is the i-th target value so that it takes values {1, −1}, αi ≥ 0, and b is a real constant. The parameter αi can be found by solving the following convex maximization quadratic programming problem max

Q X

Q 1 X αi αj γ i γ j ψ(ς i , ς j ) 2 i,j=0

αi −

i=0

subject to

Q X

(15) αi γ i = 0,

0 ≤ αi ≤ λ

Fig. 1. The simulated far-field setup with the positions for the training (40 positions) and the testing (100 positions) phase respectively. where η is a given threshold. The parameter η was empirically set to the smallest number allowing to effectively distinguish between constructively and disruptively contributing bands [17]. Finally, the DOA is estimated by picking the maximum value on the SRP-WMVDR θbs = argmax[PSRP-WMVDR (θ)].

(22)

θ

i=0

i = 1, 2, . . . , Q 3. SIMULATIONS where λ is an user specified parameter and provides a tradeoff between the distance of the support vectors from the separating margin and the training error. In this paper, we use the sequential minimal optimization [23] algorithm for solving equation (15). By taking any support vector with αi < λ, the parameter b can be calculated by b = γj −

Q X

αi γ i ψ(ς i , ς j ).

(16)

i=0

The sample input is calculated with the skewness of the normalized narrowband PSD. The skewness is a measure of the symmetry of a distribution, and it is defined as ς(f ) =

E[(PNMVDR (f, θ) − µ(f ))3 ] 3

(E[(PNMVDR (f, θ) − µ(f ))2 ]) 2

where PNMVDR (f, θ) =

PMVDR (f, θ) max[PMVDR (f, θ)]

(17)

(18)

θ

and µ(f ) is the mean of PNMVDR (f, θ) for all considered θ. The SVM classifier is trained on known DOAs. Given a reference USASI noise source signal that is fixed in training DOAs θt , the estimated DOA using the NMVDR narrowband beamformer is θbt (f ) = argmax[PNMVDR (f, θ)].

(19)

θ

The contribution to the localization error related to frequency f is Ω(f, θt ) = |θt − θbt (f )|. (20) The SVM classifier is trained to remove those narrowband components which contribute negatively to the localization. Namely, the i-th training set output γ i of the SVM is set as ( −1, if Ω(f, θt ) > η γi = (21) 1, if Ω(f, θt ) ≤ η

The localization performance of the SRP-WMVDR is illustrated through a set of simulated experiments. A uniform linear array of 4 microphones was used. The distance between microphones was 0.2 m. The setup used in these simulations is depicted in Figure 1. The considered DOA range was [−62.24, 62.24] degree. We consider the training source positions at a distance from the array of 5 m and 10 m. An USASI noise signal was used in the training phase. The parameter η was set to 3 degree since it determines a good SVM learning. The training was conducted by setting an averaging SNR of 0 dB, which was obtained by adding mutually independent white Gaussian noise to each channel. The sampling frequency was 44.1 kHz, the block size L was 2048 samples. When using the SVM learning in the NMVDR algorithm, it is required that the geometry of the array is kept similar in the training and in the testing or operating phase, and that a sufficiently high frequency resolution is used in the fast Fourier transform analysis step [17]. A frequency range between fmin = 50 Hz and fmax = 16000 Hz was used. The loading constant ∆ was set to 0.001 since a small value keeps an high resolution in each narrowband beamformer. The radial basis function kernel was adopted for the SVM classifier by setting σ = 1 and λ = 1 using a cross-validation in accordance to [17]. The test phase was conducted with a sound car signal in 100 random positions with a distance from the array in the range 5-10 m as we can see in Figure 1. The tests were conducted by setting different SNR conditions. We compare the performance of SRP-NMVDR [6], SRPWMVDR, SRP-PHAT [15], and SRP-MVDR [14]. Performance is reported in terms of the percentage of accuracy rate (AR) estimates for those errors below a given threshold υ, and by the root mean square error (RMSE) for all the estimates. The RMSE is defined as s PR br 2 r=1 (θs − θs ) (23) RM SE = R where R is the total number of analysis frames and θbsr is the

Table 1. The AR (%) and RMSE (degree) at variation of SNR (dB). SRP-NMVDR SNR

AR

RMSE

SRP-WMVDR AR

RMSE

SRP-PHAT AR

RMSE

SRP-MVDR AR

RMSE

∞

84.00

0.757

84.00

0.757

84.00

0.757

69.44

1.834

15

83,49

0,768

83,85

0,766

81,08

0,807

63,54

2,087

10

81,87

0,796

82,82

0,780

72,51

1,011

58,85

2,125

5

72,03

1,042

76,36

0,927

56,10

1,579

50,69

2,288

0

47,72

1,951

53,10

1,706

36,46

2,706

35,95

3,294

-5

27,54

5,320

25,49

17,881

22,38

8,556

22,64

4,926

estimated DOA for the frame r. The AR is defined as PR Γ r=1 r AR = 100 R where ( 1, if |θs − θbsr | < υ Γr = 0, if |θs − θbsr | ≥ υ.

4. CONCLUSIONS The paper has presented a SRP-WMVDR algorithm for farfield sound source localization in noisy conditions. We showed that a machine learning component in the broadband fusion improves the DOA estimation up to a SNR of 0 dB. The machine learning is based on a SVM which classifies the narrowband NMVDR into positively and negatively contributing bands. The skewness measure of narrowband NMVDR function is used as input feature for the supervised learning. The method performance has been illustrated on acoustic data generated by numerical simulations. Extensions of the experimental assessment to real-world datasets is presently under consideration.

(24)

5. REFERENCES

(25)

[1] J. C. Chen, K. Yao, T.L. Tung, C.W. Reed, and D. Chen, “Source localization of a wideband source using a randomly distributed beamforming sensor array,” International Journal of High Performance Computing Applications, vol. 16, no. 3, pp. 259–272, 2002.

The parameter υ was set to 1. Table 1 shows DOA estimation results. As we can observe, the SRP-WMVDR outperforms other algorithms in the SNR range of 0-15 dB. The best performance for that system configuration is achieved in the noise-free condition (SNR=∞), and all normalized algorithms have the same performance whereas the SRP-MVDR DOA estimation is degraded due to the minor spatial resolution. For SNR levels of 10 and 15 dB we have a slight improvement of SRP-WMVDR in comparison to the SRP-NMVDR since the DOA estimation performance is very close to best possible localization (noise-free case). When the noise level is high (SNR of 0 and 5 dB), the SRP-WMVDR provides a good performance with a better accuracy for an error less than 1 degree. In case of SNR=-5 dB, the machine learning component is not able to select correctly the narrowband components. We can observe a large RMSE since the noisy response power functions are characterized by a small peak corresponding to the source direction, resulting a more difficult learning for the SVM. Note that we have tested the machine learning with a training phase using an USASI signal in a SNR level of -5 dB, but in such case the SVM fails in the classification of constructive and disruptive narrowband components. Figure 2 shows the steered response power with a SNR of 5 dB for a specific block on analysis in which only the SRP-WMVDR estimates the correct DOA of the source. Figure 3 depicts the normalized broadband PSD of SRP-NMVDR, SRP-WMVDR, SRP-MVDR, and SRP-PHAT with a SNR of 10 dB for a specific block. We can observe the effect of normalization that leads a power attenuation on the directions different from the DOA source. The SRP-WMVDR has a good attenuation in comparison to SRP-NMVDR and SRP-PHAT.

[2] Q. Huang, Q. Zhong, and Q. Zhuang, “Source localization with minimum variance distortionless response for spherical microphone arrays,” Journal of Shanghai University, vol. 15, no. 1, pp. 21–25, 2011. [3] D. J. Mennill, M. Battiston, D. R. Wilson, J. R. Foote, and S. M. Doucet1, “Field test of an affordable, portable, wireless microphone array for spatial monitoring of animal ecology and behaviour,” Methods in Ecology and Evolution, vol. 3, no. 4, pp. 704–712, 2012. [4] D. Salvati and S. Canazza, “Incident signal power comparison for localization of concurrent multiple acoustic sources,” The Scientific World Journal, vol. 2014, pp. 1–13, 2014. [5] C. R. Rowell, D. Fee, C. A.L. Szuberlaand K. Arnoult, R. S. Matoza, P. P. Firstov, K. Kim, and E. Makhmudov, “Three-dimensional volcano-acoustic source localization at Karymsky Volcano, Kamchatka, Russia,” Journal of Volcanology and Geothermal Research, vol. 283, pp. 101–115, 2014. [6] D. Salvati, C. Drioli, and G. L. Foresti, “Incoherent frequency fusion for broadband steered response power algorithms in noisy environments,” IEEE Signal Processing Letters, vol. 21, no. 5, pp. 581–585, 2014. [7] T. Hori, Z. Chen, H. Erdogan, J. R. Hershey, J. Le Roux, V. Mitra, and S. Watanabe, “The MERL/SRI system

Fig. 2. The broadband PSD of SRP-NMVDR, SRP-WMVDR, SRP-MVDR, and SRP-PHAT with a SNR of 5 dB for a specific block. Only the SRP-WMVDR estimates the correct DOA of the source. Fig. 3. The normalized broadband PSD of SRP-NMVDR, SRP-WMVDR, SRP-MVDR, and SRP-PHAT with a SNR of 10 dB for a specific block. for the 3RD CHiME challenge using beamforming, robust feature extraction, and advanced speech recognition,” in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2015, pp. 475 –481. [8] J. Heymann, L. Drude, A. Chinaev, and R. HaebUmbach, “BLSTM supported GEV beamformer frontend for the 3rd CHiME challenge,” in Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding, 2015, pp. 444–451. [9] F. Ribeiro, C. Zhang, D. A. Florencio, and D. E Ba, “Using reverberation to improve range and elevation discrimination for small array sound source localization,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 7, pp. 1781–1792, 2010. [10] D. Salvati, S. Canazza, and A. Rod`a, “A sound localization based interface for real-time control of audio processing,” in Proceedings of the International Conference on Digital Audio Effects, 2011, pp. 177–184.

Applications, chapter Robust localization in reverberant rooms, Springer, 2001. [16] D. Salvati, C. Drioli, and G. L. Foresti, “Frequency map selection using a RBFN-based classifier in the MVDR beamformer for speaker localization in reverberant rooms,” in Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH), 2015, pp. 3298–3301. [17] D. Salvati, C. Drioli, and G. L. Foresti, “A weighted MVDR beamformer based on SVM learning for sound source localization,” Pattern Recognition Letters, 2016. [18] H. Cox, R. Zeskind, and M. Owen, “Robust adaptive beamforming,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 35, no. 10, pp. 1365–1376, 1987. [19] B. D. Carlson, “Covariance matrix estimation errors and diagonal loading in adaptive arrays,” IEEE Transactions on Aerospace and Electronic Systems, vol. 24, no. 4, pp. 397–401, 1988.

[11] D. Salvati and S. Canazza, “Adaptive time delay estimation using filter length constraints for source localization in reverberant acoustic environments,” IEEE Signal Processing Letters, vol. 20, no. 6, pp. 507–510, 2013.

[20] J. Li, P. Stoica, and Z. Wang, “On robust Capon beamforming and diagonal loading,” IEEE Transactions on Signal Processing, vol. 51, no. 7, pp. 1702–1715, 2003.

[12] Y. Li, K. C. Ho, and M. Popescu, “A microphone array system for automatic fall detection,” IEEE Transactions on Biomedical Engineering, vol. 59, no. 5, pp. 1291– 1301, 2012.

[21] X. Mestre and M. A. Lagunas, “Finite sample size effect on minimum variance beamformers: optimum diagonal loading factor for large arrays,” IEEE Transactions on Signal Processing, vol. 54, no. 1, pp. 69–82, 2006.

[13] M. S. Bartlett, “Smoothing periodograms from timeseries with continuous spectra,” Nature, vol. 161, pp. 686–687, 1948.

[22] Y. L. Chen and J.-H. Lee, “Finite data performance analysis of MVDR antenna array beamformers with diagonal loading,” Progress In Electromagnetics Research, vol. 134, pp. 475–507, 2013.

[14] J. Capon, “High resolution frequency-wavenumber spectrum analysis,” Proceedings of the IEEE, vol. 57, no. 8, pp. 1408–1418, 1969. [15] J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, Microphone Arrays: Signal Processing Techniques and

[23] J. C. Platt, “Sequential minimal optimization: A fast algorithm for training support vector machines,” Tech. Rep., Microsoft Research, 1998.