Multiple Sound Sources Localization with Perception

0 downloads 0 Views 1MB Size Report
localization of multiple sound sources with multiple microphone arrays of the perception sensor network. Traditional single array sound source localization (SSL).
TuA2T1.7

2013 IEEE RO-MAN: The 22nd IEEE International Symposium on Robot and Human Interactive Communication Gyeongju, Korea, August 26-29,2013

Multiple Sound Sources Localization with Perception Sensor Network Quang Nguyen, and JongSuk Choi Abstract-This paper presents a general framework for localization of multiple sound sources in Cartesian coordinate with perception sensor network (PSN). PSN consists of Kinect sensor that has a color camera, a depth camera, and an internal microphone array and our experimental pan-tilt-zoom camera with attached microphone array. Sound localization with PSN is based on three-stage analysis. Short-time narrowband directional sound localization based on Phase difference of arrival (PDOA) is obtained in every time-frequency points, utilizing the sparseness assumption of audio mixtures. Multi-sensor directional localizations are transformed to Cartesian coordinate by a simple triangulation. The results are accumulated in all frequency bins for a block of frames and then clustered to obtain mid-term broadband localization. Furthermore, the framework is able to integrate any Bayesian filtering algorithms for long-term localization. Simulation results with four arrays (each has four microphone) show that the proposed framework successfully localize three simultaneous sources where the distance among sources is about one meter.

I.

requires at least two sensors for estimating the source's direction and then performing triangulation with known arrays' position. Although this approach sounds obviously a trivial extension of such general single-array-directional-SSL techniques, it poses three challenges in maintaining good performance when multiple sound sources are presence, as shown in Fig. 1. Considering one case which two sound sources are observed by two arrays, then the number of intersection points by triangulation is four, including two targets and two "ghosts" with confusion. Secondly, typical sound localization algorithms for a single sensor are low accuracy with mean-error is around 5 degree, resulting absolute error of 0.35 meter at distance between source and array of 4 meter. This variance rises significantly with increasing the number of simultaneous sound sources. Last but not least, small microphone array itself is low discriminate resolution, i.e. two sources should be as far as 30 degree or more.

INTRODUCTION

Recently, the availability of off-the-shelf consumer hardware such as Microsoft Kinect and Asus Xtion has open up many researches in robotics field. Using a single sensor mounted on the robot head, such as Willow Garage's PR2 or TurtleBot, it is able to perform face/body detection, source localization and mapping and even 3D perception [1]. Beside utilization of color and depth images, those sensors also have an internal microphone array (four microphones of Kinect, two microphone of Xtion), thus enables localization of single or multiple surrounding sound sources [2]. Together with work in [3], we combine multiple sensors in to a network aiming to extend the working area (i.e. field of view of camera) and enhance performance from redundant information of multiple sensors. Specifically, this paper focuses on localization of multiple sound sources with multiple microphone arrays of the perception sensor network. Traditional single array sound source localization (SSL) techniques provide only direction of the sound source, such as azimuth and elevation [4]. However, it lacks of the range (distance) between the sound source and the microphone array. Thus, sound localization in Cartesian coordinate system *Resrach supported by the Implementation of Technologies for Identification, Behavior, and Location of Human based on Sensor Network Fusion Program through the Ministry of Knowledge Economy (Grant Number: 10041629) Quang Nguyen is with University of Science and Technology (UST) and Korea Institute of Science and Technology (KIST), Seoul, Republic of Korea (email: [email protected]). JongSuk Choi is with Korea Institute of Science and Technology (KIST), Seoul, Republic of Korea (phone: +82-2-958-5618; fax: +82-2-958-5629; email: [email protected]). 978-1-4799-0509-6/13/$31.00 ©2013 IEEE



\

..... -

... ,�

,� , '$1 \

y ,

....

....

\ \

"

f

,x

_ _

_

I...::=..

S2

.J.. 1

J A4

.., .........

I I I

tV;J 1 �

1.., ,"

Al

x Fig. I Multiple microphone arrays fusion

Due to complexity of hardware setting, a few techniques for sound localization in Cartesian coordinate have been suggested in literature. For single source localization, any trivial extension of directional SSL algorithms may satisfy with good performance. In case of multiple sources, Nesta and Omologo [5] presented a method based on spatial likelihood function observed at multiple microphone pairs. Rather than a trivial summation of pair wise likelihoods, the authors proposed a Gaussian kernel cost function to reduce ambiguous locations with high likelihood (i.e. ghosts). Another approach is presented in [6] which estimates TDOA(s) of distributed microphone pairs, formulates a Bayesian tracking on multidimensional TDOA. The spatial positions are derived by applying triangulation on each TDOA estimate independently. However, both of methods were demonstrated by only two pairs of microphones. For larger sensor systems, i.e. our system consists of 4 sensors with 6 pairs each, that is unfeasible to extend the multidimensional TDOA method due to the computational complexity of higher dimensional.

418

In this paper, we propose a new framework for multiple sources localization with multiple microphone arrays. The framework mainly resides on the sparseness of sound sources [7], i.e. there is only a single source in any single time-frequency point. Single source directional sound localization in each array is obtained by phase difference of arrival (PDOA) algorithm in every time-frequency points. Short-term narrowband localization in Cartesians coordinate is obtained by early fusion (i.e. linear least square triangulation) of directional localization of all arrays. A preliminary removal of localization outliers is performed by computing back-projecting triangulations to the results of PDOA algorithm. Surviving localizations are accumulated and clustered to obtain localizations in mid-term broadband results. Finally, a comparison analysis of the different approaches in view of localization rate is performed. This paper is organized as follow. Section II briefly describes the directional sound localization with single microphone array. Section III explains a trivial extension of method in Section II and Gaussian kernel method for localization in Cartesian coordinate with multiple arrays. A new localization framework is presented in detail in Section IV. Simulation results are shown in Section V, followed by concluding remarks and future works in Section VI. II.

SINGLE ARRAY SOUND LOCALIZATION

In this section, we provide a brief introduction to phase difference of arrival (PDOA) sound localization, which is used to estimate the direction of sound source with a single small microphone array.

In frequency domain, i.e. at frequency J, the above TDOA can be expressed by phase difference of arrival (PDOA):

The PDOA above depends only on microphone array configuration and the sound velocity. In practice, we pre-caculate PDOA on varieties of direction and frequency. The calculation results are stored as the PDOA map. Ideally, the PDOA map is independent to the surrounding environment, although reverberation could degrade the sound localization performance. B.

PDOA

Map Sound Localization

Suppose that the source signal at time-frame k and frequency f is S ( k, f ) then the received signal at microphone

i is:

(4) where

A, is the loss of magnitude and

from the sound source to the microphone.

'i

is the delay time

Given the Short-Time Fourier Transform (STFT) of two microphones, PDOA measurement in each time-frequency (TF) point is: lllij

Xi ( k) kJ = L X ,J kJ - 2;rf'j () kJ = 2;rf () () , , , k ,J) j( 'i

(5)

Narrowband PDOA error is calculated by comparing the PDOA measurement with the pre-calculated PDOA map: (6) eij ( , kJ )-Mij ( , J) e , kJ , e )=OJij ( Since the signal is processed in the frequency domain among various microphone pairs, the broadband directional score is summarized for all frequency bins. Furthermore, it is also summarized for all microphone pairs:

(7) LLcos(eij (k,J,e)) ij f The above equation uses cosine function hence eliminates the effect of spatial alias at high frequencies [8]. Finally, directions and the number of sources are determined by searching for all local minimal values

C(k,e)

Fig. 2. Microphone array and sound source coordinate system A.

PDOA

Map Calculation

Suppose that there is a microphone pair located at If

= [Xi'Yir

and j r =[ j y, T,and a sound source is at xj

direction of azimuth e, as shown in Fig. 2. The time difference of arrival (TDOA) between two microphones is:

)Tij (e _

where

c

) dij ( e c

is the sound velocity, and dij (e)

(1) is the relative

distance between the two microphones with respect to that sound source: (2)

=

BU(k) ( ) k where

L(k)

=

argpeakC e

( k , e )

(8)

is the number of local peaks of C(k,e). In real

implementation, a maximum number of sources or a threshold on peak values should be used to reduce false positive. III. A.

MULTI-ARRAY FUSION

Trivial extension of single array

Suppose that we have N microphone arrays (i, ..,n, .. ,N) distributed around the space. By knowing these arrays' position, it is trivial to calculate PDOA map with respect to Cartesian coordinate

M�n) ( , f, x) y.

Consequently, PDOA

error in Eq. (6) is replaced by

) ) k k xy x) e�n) ( , J , , y = OJij ( , J -M�n) ( , J ,

(9)

The broadband multi-array PDOA score in Eq. (7) is now 978-1-4799-0509-6/13/$31.00 ©2013 IEEE

419

C1 (k,x,y) =

(10)

LLLcos(e�n) (k,f,x,y)) n U

f

Thus, source's localizations are determined by searching for peak values

(11)

{X,yLL(k) (k)= arg peak Cj (k, x , y) X,Y

where L(k) is the number of local peaks of B.

Cj (k, x, y) .

C(k,x,y) has local peaks only

Ie�n) (k,f,x,y)1 ( �� � n) L

2

(12)

Then the broadband PDOA score is obtained by a kernel function:

C2(k,x,y)= Lg(e(k,f,x,y)) where

--

f

1 g ( e (J)) = e 2trJ

reduce ghost, performance.

[e([)J2 -� 2h2

C� (k, f,f)) = L COS(

where

n)

e�n) (k,f,f))) is the single-array error ebn) (k,f,f)) is calculated

ij

score

B

with

Azimuth estimations of all arrays are fused to determine x-y position by linear least square triangulation [9] with known position (i.e. central point) of each microphone array (x(n),

y(n)} :

= U-

PROPOSED FRAMEWORK

Fig. 3 illustrates our proposed framework for localization of multiple sound sources in Cartesian coordinate with multiple microphone array. In this section, we present in detail two main first steps for sound localization. The last step for post-processing is remained to future work.

Q

=

source's

(13)

is a Gaussian kernel aims to

(15)

where

1

{Qx} Qy

x(1) tan ()(1) + y(1)

)

matrix

w.r.t

B= the

1

tan ()(1) ; tan ()(N) original,

and

is source's directional vector w.r.t

X(N) tan ()(N) + /N)

Furthermore,

{

is source's location,

directional

to each array.

with a consequent increase of localization

IV.

(14)

()(n) (k,f)= argpeakC�n) (k,f,f))

individually in each array such as in Eq. (6).

at sources' position. However, due to background noise, self noise of microphones or reverberant environment, it introduces many "ghost" peaks into the total broadband score. Thus, rather than a simple summation over many frequencies such in Eq. (10), Nesta and Omologo [5] utilizes normalization in every narrowband PDOA error:

e(k,x,y) =

Utilizing the assumption of sparseness sound sources, a narrowband single source azimuth is estimated in time-frequency point:

PDOA

Enhancement by Gaussian kernel The above method requires

Narrow-band Early Fusion

A.

bad triangulation result is eliminated:

accept for clustering if I'lf) < thj reject

-

if

I'lf) = I BQ iill is

where

I'lf)?: thj angular mean square error of the

back-projecting triangulation results to the real estimations of the PDOA algorithm. B.

.,. �

Clustering In this step, we find all clusters (sound sources) formed by

�I

/







,\ !

/r �



Buffer

& Score

Environment: multi-source

Signal decomposer

Short-term narrow­ band single source

\" \. , \ ."

f-?

GMM Clusteri ng

Mid-term broad-b and multi-source

Long-term broad­ band multi-source

Fig. 3 Proposed framework for localization of multiple sound sources in x-y plane using multiple microphone arrays. 978-1-4799-0509-6/13/$31.00 ©2013 IEEE

420

surviving triangulation results

Q . Suppose that there are

G

clusters in this step. Each cluster is represented by its mean Pm,

covariance Pm, and its weight am. The summation of all clusters is a Gaussian mixture model (GMM) G

f

=

2.,amN(Jlm,Pm) m=!

speech signal. The length of each block is 15 frames corresponding to 300 milliseconds with non-overlapping. Note that the block length is much shorter than the one in [5] (1 second) since we aim this approach in robotic application with short responsible time.

(16)

where one Gaussian is assigned to one position (= one source). The above model is fitted to dataset by iterative Expectation-Maximization (EM) algorithm [10]. To enhance the performance of GMM clustering, we assume that one Gaussian (m=O) belong to background measurement (clutter), that has uniform distribution. For simple implementation, the first Gaussian is set to fixed mean value (={O,O}) and very large covariance. Optimal model (the number of clusters G) is selected by Bayesian information criterion [10]. V. EXPERIMENT RESULTS

This paper aims to develop an efficient and accurate localization method for a perception sensor network as shown in Fig. 4. This network consists of two Kinect sensors (each has a linear array of four microphones) and two pan-tilt-zoom cameras with attached microphone array (four microphone in three dimensional array). However, due to current limitation of a reliable multi-sensor calibration, we only evaluate our method on simulation environment. Room impulse responses have been simulated by image method using the ArrayToolbox [11] according to the geometry shown in Fig. 5. The mixtures were obtained by convolving each speech signal (sampled at 8 kHz), summing contribution of the sources at each microphone. Furthermore, the mixtures are perturbed with noise recorded in a real office room to archive an average SNR (signal-to-noise ratio) of 20 dB.

i .���������� : "����r--�t---"���

J

>

�� ���f--r�� ' �r--r--�r-r-r-r-r-�

15

1.5 2. X-Dim�nsbn � m�t=

2.5

3

3.5

1.5 2. X-Drn�nsk>ninm�t=

2.5

3

3.5

Fig. 5. Simulation setup with four microphone arrays (two 3D arrays + two linear arrays): full side setting (left) and one sides setting (right)

A. Qualitative comment In the first qualitative evaluation, we compared the resulting localization obtained by different methods (indicated in Fig. 6. Since the cosine function of the trivial method is low resolution (i.e. cosine of 30 degree = 0.866 is very close to cosine of 0 degree), this method likely merges all sources into a single one. Although this method might be good for single source localization, it is not suitable for multi-source case. Kernel method narrows down the accepted error by the "kernel's bandwidth", therefore increases significantly the resolution of multi-source localization. However, the normalization (Eq. (12)) is quite strange in case of high number of microphone pairs, i.e. our PSN setting has 24 pairs in 4 arrays. This is a result of the squaring of each term, which effectively weights large errors more heavily than small ones. This property has rejected contribution of many low error pairs even if only a single pair has high PDOA error (Eq. (6)). Solution of such problem could be utilizing median-squared normalization rather mean since median is more robust to outlier than mean. However, validation of this suggestion is out of scope of this paper.

(a) Trivial PDOA extension (Eq. (10»

(b) Gaussian kernel (Eq. (13»

Fig. 4 Perception sensor network setting

PDOA algorithms process signals every frame with the frame length 32 millisecond (256 samples) and the shifted length 20 millisecond (160 samples). Each frame of signals is transformed to frequency domain with 256 points of FFT. Due to background noises that dominates signal at low and high frequencies, broadband summation is performed only in the band from 300 Hz to 3.2 kHz which is sufficient for 978-1-4799-0509-6/13/$31.00 ©2013 IEEE

(c) Scatter plot of proposed method (Eq. (15»

(d) 2D histogram of (c)

Fig. 6 Simulation results obtained by different methods. Red circles indicate microphones' location and red crosses indicate true sources' location. 421

It is worth noting that the block length plays important role in determine the performance of each method. In our preliminary results, with block length of 50 frames (l second), the Gaussian kernel method can localize four or even five sound sources. However, with shorter block length, for example 5 or 10 frames, this method is only good for one or two sources and it hard to localize more sources. Although the result of our proposed method is mixed of sources, it is still well separated by GMM clustering. Therefore, the proposed method achieves a better result with short block length. The current block length (15 frames) is decided to maximize the performance of our proposed frameworks. In the future, it is expected that the integration of Bayesian tracking system can reduce the block length, resulting a faster responsible time of the PSN system. Also of importance is the fact that although simple k-means clustering algorithm can be applied in clustering step (section IV.B), the GMM clustering is more suitable to this case since GMM is more robust to overlap and/or ellipsoid clusters than k-means clustering. B.

Evaluation on reverberant room

In the first quantitative evaluation, we measured the localization rate over 60 simulated mixtures (obtained by segmenting signals into 300 milliseconds block) in various reverberant room conditions. Sound source is considered to be localized successfully if its root-mean-square-error (RMSE) is less than 0.5 meter. 100 80 .Trivial

60

• Kernel

40

• Proposed 20

RT60=100

200

300

400

500

Fig. 7 Simulation results: Localization Rate [%]

Fig. 7 shows the simulation results on various reverberant conditions in case of three simultaneous sources. It can be seen that the trivial extension method and the Gaussian kernel method have localization rates less than 67% which mean that they almost detects only two sources among three. In other hand, it is observed that the proposed method achieves approximately 83% localization rate of three sources, equivalent to 2.5 sources among three.

Gaussian kernel obtains mostly two among sources due to rejection by the kernel bandwidth, as earlier discussion. Lastly, the proposed method can achieve average three sources, although its performance degrades significantly in case of four or five sources. 120 100 80

.Trivial

60

• Kernel

40

• Proposed

20 0 #5=1

2

4

Fig. 8 Simulation results of various number of sound sources [RT60=300 ms]

D.

Evaluation on one-side setting

In this experiment, all four microphone arrays are put in the same side as shown in the right setting of Fig. 5. Since all sensors are placed in one side, the spatial PDOA score is stretched in long but thin ellipsoid as discussed in [6]. The local peaks of the ellipsoid PDOA score may fall in positions far from the arrays. Thus, it yields a poor performance of the trivial algorithm. This is more challenge than the previous all side setting, however it is close to our real PSN setting, as shown in Fig. 4. Simulation results are shown in Fig. 9. For high number of sources, such as four or five, the performance drops significantly. The main reason might be that the assumption of sparse sources does not hold anymore. From this point of view, the kernel method requires all microphone pairs have to see to the same source at any TF point. Because of this tight assumption, its localization rate is good only in case of less than three sound sources. The proposed method requires all arrays have to see to the same source at any TF point, however it is acceptable if one or two pairs inside an array can look to another direction. The proposed method is more relaxed the sparseness assumption than the kernel method, hence yields a better localization rate. 120 100 80

.Trivial

60

• Kernel

40

• Proposed

20

C.

Evaluation on the number of sound sources

This simulation aims to verify the robustness of SSL algorithms in multi-source environment. As shown in Fig. 8, all methods work perfectly in case of single source. When the number of sources increases, the trivial method can localize mostly a single one, resulting poorest performance. The 978-1-4799-0509-6/13/$31.00 ©2013 IEEE

0 #5=1

2

4

Fig. 9 Simulation results of "one side" setting on various numbers of sources [RT60=300 ms]

422

E.

Computational Time

http://www.microsoft.com/en-uslkinectforwindows/develop/developer

Finally, we consider the computational time of all algorithms which are crucial important in real time applications such as robotics. Although we only implement non-optimally those algorithms in Matlab and test them offline, these results might give an insightful view of how much does it cost in real time implementation. COMPUTATIONAL TrME ON PROCESSING 10 SECONDS MIXTURE

TABLEL

Algorithm

-downloads.aspx [3]

94.99

PDOA with Gaussian kernel

153.33

Proposed

28.83

1. Choi, "3D Simulation Approach for Multi human

using

Symposium

on

Multi-sensor".

Robot

and

submitted

Human

to

IEEE

Interactive

International

Communication

(RO-MAN), 2013. [4]

J.M. Valin,

F. Michaud, and J. Rouat, "Robust localization and tracking

of simultaneous moving sound sources using beamforming and particle

filtering," Robotics and Autonomous System, vol. 55, pp. 216-228, 2007. [5]

Nesta,

F.;

Omologo,

M.;

,

"Enhanced

multidimensional

spatial

functions for unambiguous localization of multiple sparse acoustic sources." Proceedings of the IEEE

International Conference

on

Acoustics, Speech and Signal Processing (ICASSP) , pp.213-216, 25-30

Time [second]

Trivial extension of PDOA

K. Ahn, and Detection

March 2012 [6]

Brutti. Alessio, and Francesco Nesta. "Tracking of multidimensional TDOA for multiple sources with disttibuted microphone pairs." Computer Speech

[7]

& Language (2012).

Yilmaz, Ozgur, and Scott Rickard. "Blind separation of speech mixtures via time-frequency masking." IEEE Transactions on Signal Processing, 52.7 (2004): 1830-1847.

Table II shows the computational time on processing 10 seconds (33 blocks) of mixtures. The computer is a normal PC with an Intel's is processor running at 2.S GHz and memory of 2 GB RAM. Both trivial and Gaussian kernel methods relies on calculation the PDOA score in five dimensions (x-y plane - grid with resolution 0.2 meter, 24 pairs of microphones, 93 points of FFT, IS frames of a block). Furthermore, the calculation consists of cosine or Gaussian functions resulting very high cost. For the proposed method, calculation is reduced to four dimensions (azimuth instead of x-y plane) and the triangulation is done simply by a matrix multiplication. Although GMM clustering is quite heavy, it runs only once per block thus does not contribute much to the total cost. VI.

[8]

B.

Loesch,

and

B.

Yang,

"Blind

time-frequency sparseness in the

source

separation

based

on

presence of spatial aliasing",in

Proceedings of the LVAlICA, 2010, pp. 1.8. [9]

Kaplan, Lance M., Qiang

Le, and N. Molnar. "Maximum likelihood

methods for bearings-only target localization." Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), V ol. 5. IEEE, 2001. [10] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2007 [11] K. Donohue, "Audio Systems Array Processing Toolbox", 2009 W ebsite: http://www.engr.uky.edu/-donohue/audio/Arrays/MAToolbox.htm

CONCLUSION AND FUTURE WORK

This paper presented a framework for localization of multiple sound sources in Cartesian coordinate with multiple microphone arrays. Rather than calculating broadband PDOA score with some cost function as existing methods, we proposed to estimate source location in every narrowband frequency bins. The results are accumulated in both time and frequency domains for clustering to obtain broadband localization. We carried out extensive simulation tests and the results confirmed that the proposed algorithm delivers better localization rate compared with existing algorithms in a variety of scenarios. In the near future, we will implement the proposed algorithm in C/C++ to test in real time. For audio-only application, it will be featured with a Bayesian filtering such as Kalman filter or particle filter. Furthermore, the proposed algorithm also will be integrated to vision tracking algorithm for fully utilize audio-visual sensors of the perception sensor network. REFERENCES [1]

ROS OpenNI camera package Website: http://www.ros.org/wiki/openni camera

[2]

Kinect for Windows SDK Website:

978-1-4799-0509-6/13/$31.00 ©2013 IEEE

423