LOCALIZATION OF MULTIPLE ACOUSTIC SOURCES WITH A

6th Congress of Alps-Adria Acoustics Assosiation 16.-17. October 2014 Graz, Austria ISBN: 978-3-200-03921-6

LOCALIZATION OF MULTIPLE ACOUSTIC SOURCES WITH A DISTRIBUTED ARRAY OF UNSYNCHRONIZED FIRST-ORDER AMBISONICS MICROPHONES Christian Schörkhuber, Philipp Hack, Markus Zaunschirm, Franz Zotter, Alois Sontacchi Institute of Electronic Music and Acoustics ([email protected]) Abstract: Distributed microphone arrays exploit the spatial diversity of an acoustic scene and obtain higher signal-tonoise ratios than compact microphone arrays that sample the sound field only locally. However, as distances between distributed microphones grow, wired connections become infeasible and Wireless Acoustic Sensor Networks (WASN) need to be employed. Due to synchronization and bandwidth issues in such networks, many established algorithms e.g. for self-calibration, source localization or source extraction relying on synchronized data and the availability of all microphone signals at a central processing unit are no longer applicable. In this paper we propose a solution to the acoustic source localization problem of multiple sources in 3 dimensions using unsynchronized sensor nodes, each equipped with a compact tetrahedral microphone array and a local processing unit. Keywords: Wireless acoustic sensor networks, acoustic source localization 1. INTRODUCTION

acoustics. To tap the full potential of such networks in realworld scenarios, numerous sub-tasks need to be addressed. Acoustic sensor networks consist of several spatially dis- We discriminate between three classes of problems, namely tributed sensor nodes that are randomly placed in the area of system parameter estimation, source parameter estimation interest, each equipped with a small number of microphones and signal estimation. (i.e. compact microphone arrays).1 That is, microphone array processing algorithms can be applied to the individual System parameters describe the setting of the scene to be compact arrays and the acquired data can be jointly pro- analysed, such as relative or absolute locations of microcessed at the fusion center. If distances between the sensor phones as well as the locations and acoustic properties of nodes are large or wired connections between the nodes and reflective surfaces (e.g. walls or furniture). That is, typical the fusion center are infeasible, Wireless Acoustic Sensor problems in this class are sensor self-calibration, room geNetworks (WASN) are preferred. ometry inference and reflection coefficient estimation. The main challenges for practical applications of such net- Source parameters provide meta-information about the works arise from the limited bandwidth, clock synchroniza- content of an acoustic scene, such as the number of active tion issues [1] and limited energy resources [2][3]. As trans- sources, their locations and orientations along with corremission of raw audio data from all sensor nodes to the fusion sponding spatial trajectories. Other problems belonging to center is frequently impossible, each sensor node performs this class are acoustic event detection and sound classificalocal processing of its sensor signals and transmits only meta- tion. Finally, signal estimation aims at extracting actual data or compressed data to the fusion center or neighbouring signals from the acoustic scene. Source separation, derenodes. verberation and speech enhancement are well researched topics residing in this class. Another example, however, is Applications of WASN are numerous, e.g. hands-free tele- spatial sound acquisition, which refers to the problem of phony, teleconferencing, signal enhancement, dereverbera- finding a compact (parametric) description of a complex tion, acoustic monitoring in domestic, urban, industrial or acoustic scene to facilitate its transmission, alteration and wildlife scenarios, ambient intelligence, assisted living, room reproduction. acoustics, intelligent multi-microphone recording and virtual With the Wireless Large-Scale Microphone Array (WiLMA) 1 Depending on the application the number of deployed sensor nodes developed at the Institute of Electronic Music and Acoustics can range from two to several thousand.

(IEM) [4] we are able to develop and evaluate algorithms for WASN under both synchronized and unsynchronized conditions with different microphone array configurations. In this contribution we present preliminary results for the 3-dimensional source localization task assuming unsynchronized sensor nodes.

be time-stamped, so the recording of multiple SMs can be time-aligned later in an online process.

Streaming: For recording and monitoring purposes, it might often be desirable to not collect the audio data decentralised on the SMs and collect them later, but rather have all audio channels available immediately at the fusion center, by In section 2 we give a brief overview about the WiLMA means of realtime streaming. This allows the sensor network system. In section 3 an approach for multiple source local- to be used as a de-centralised capture-only multichannel ization in 3D space using unsynchronized sensor nodes is sound card. presented. In section 4 we present measurement results for a setup of 4 sensor nodes and two sources. We conclude with Processing: Each SM is also equipped with a local processing unit that can be used to analyze the local signals, paralsection 5. lelizing the computational load. The result could be either an enhanced signal, meta-data about the signal or a mixture of both (e.g. using signal identification on the 4 channel 2. WILMA SYSTEM recording, it is possible to only stream a mono-version of Traditionally, the sensor nodes of a wireless sensor network the signal together with positional meta data). that captures sound events are populated with low quality microphones, amplifiers and analogue to digital converters Mixed: Multiple connected SMs need not operate in the (ADCs) in order to decrease sensor node size, power con- same mode. For instance, some SMs could be streaming sumption and cost. The Wireless large-scale microphone audio, whereas other SMs would only do processing and array (WiLMA) introduces high quality audio processing in send meta-data to the fusion centre. wireless sensor networks. Each of the sixteen sensor modules (SM) allows for capturing of up to four high-end micro- For the source localization task described in the next secphone signals which in turn enables the use of a 4-channel tion, all sensor modules are operating in processing mode microphone array (e.g. first order tetrahedral ambisonics using unsynchronized nodes each equipped with a first-order microphone) per SM. Thus, the system operates as a large Ambisonics microphone (Oktava 4D-ambient, see Figure 1). scale microphone array, with a total of 64 audio channels. A single SM and the used microphone array are depicted in 3. SOURCE LOCALIZATION Figure 1. In the last decades, numerous approaches to acoustic source localization have been proposed. With the main focus on applications in audio-video conferencing, traditionally compact microphone arrays were considered [5]. More recently, applications in ambient intelligence and large scale acoustic monitoring steered the research interest towards source localization using distributed microphone arrays. As we consider unsynchronized sensor nodes, we split the problem into two sub-tasks, namely the estimation of direction-of-arrival (DOA) histograms that is performed locally on each sensor Fig. 1: Sensor module and microphone array (Oktava 4Dmodule, and the combination of DOA histograms to localize ambient). multiple sources in 3D space. WiLMA was designed to cover a large range of research applications, from scenarios with unsynchronized, low bandwidth and low power devices, to research steered towards novel distributed recording techniques where synchronized, high quality audio streams are desired. To this end, the sensor modules can be wirelessly synchronized with reasonable accuracy.

3.1. DOA Histogram Estimation

Arguably the most prominent approaches to DOA estimation are based on time-difference-of-arrivals (TDOA) estimations between microphone pairs utilizing either the Generalized Cross Correlation with Phase Transform (GCC-PHAT) [6] or the Steered Response Power with Phase Transform (SRP-PHAT) [7]. As we use tetrahedral microphone ar2.1. Modes of Operation rays2 , however, we focus on a modal approach where the raw microphone signals are first decomposed for zeroth- and To meet the different requirements mentioned above, each first-order spherical harmonics [8] (B-format). The zerothsensor module can operate in one of three modes. order harmonic corresponds to the omnidirectional signal w, and the first-order signals x, y and z correspond to dipoles Recording: The simplest operational mode is to record the 2 Each of the individual microphones exhibits a cardioid characteristic. microphone signals locally on the SM. The recording should

for orthogonal axes (see Figure 2). Using these B-format signals rather than the raw microphone signals, DOAs can be estimated using an intensity vector based approach [9].

h[q] =

X

gk δq,ˇqk ,

(3)

k

qˇk = arg min kθq0 − θk k, 0 q

gk = |w[k]|α ,

with α ≥ 0,

where the binary detector δq,ˇqk

( 1 = 0

if q = qˇk otherwise

indicates the closest dictionary entry θqˇ by a Kronecker delta at the corresponding index. Note, that for α = 0 the standard (unweighted) histogram is obtained, however, we recommend α > 0 to give more weight to DOA estimates that correspond to frequency indices containing more energy. Fig. 2: Spherical harmonics of zeroth- and first-order. The Figure 3 shows an exemplary DOA histogram, in which the phase information is color encoded where dark and light codebook {θq } is chosen such that the unit sphere is pargray correspond to phase values of 0 and π, respectively. titioned into equal areas [10]. Another possible codebook would be to sample the unit sphere with increments of equal To enable localization of multiple (concurrent) sources, the angles. However, the drawback of such a codebook is, that temporal and spectral disjointness property exhibited by due to the increasing density of sampling points towards most sound mixtures is exploited by first transforming all B- the poles, histogram entries will be biased towards the equaformat signals into instantaneous frequency domain represen- tor. A comparison between these two spherical sampling tations w[k], x[k], y[k], and z[k], respectively, using the win- schemes is given in Figure 4. It can be observed, that the dowed discrete Fourier transform, where k = 0, 1, ..., K −1 equal area partition scheme in Figure 4b eliminates the increasing density of sampling points towards the poles that is the frequency index. can be observed in Figure 4a. The instantaneous sound intensity vector at frequency index k is computed as [9]

  x[k]  Ik = < w∗ [k] y[k] .  ρ0 c  z[k] 1

 

(1)

where ρ0 and c are the density and the speed of sound, respectively. We express the instantaneous DOA at the frequency Fig. 3: Exemplary histogram h over azimuth and elevation index k by the unit vector angles for 3 concurrently active sources. Bin values are color coded with dark regions corresponding to high values. Ik θk = . (2) kIk k 3.2. Source Localization in 3D Assuming approximate spectral disjointness of all active sources, such that all frequency bins k are either dominated by a single source or noise, we can obtain DOA estimates for all active sources by computing a DOA histogram h over all frequency indices k. The binning of the histogram is defined by a dictionary {θq }Q of Q directions. To attenuate the contribution of DOA estimates of noise components, we weight each estimate by a signal-strength-related factor gk rather than simply counting the number of DOA estimates corresponding to a histogram entry h[q], i.e.

By combining the information obtained by the histograms hm , m=1, 2, ..., M of M distributed sensor nodes (i.e. microphone arrays), multiple sources can be localized in 3 dimensional space. One obvious approach is to pick candidate source DOAs for each array and perform localization by some sort of triangulation. The notion of triangulation, however, usually degenerates for multiple sources and noisy DOA estimates. Furthermore, estimating the number of active sources based on a single local observation is both non-trivial and prone to

(a) Equiangular sampling.

(b) Equal area sampling.

Fig. 4: Distribution of histogram bins on the unit sphere. The codesbooks {θq }Q with Q = 1800 are obtained by applying two different spherical sampling schemes. errors [11]. Hence, we choose to jointly process all gathered With histograms at the fusion center.

   bp =  

h1 [ˇ qp,1 ] h2 [ˇ qp,2 ] .. .

    

In the approach presented here, localization of multiple sources is performed in an iterative fashion, similar to the hM [ˇ qp,M ] the method presented in [12]. Firstly, a discrete spatial likelihood function L(·) is computed as a weighted superposition we denote the column vector containing the "projected" of all histograms projected into the volume of interest, sim- histogram entries of all sensor nodes at grid point gp . ilar to the approach presented in [13]. Secondly, a single source is identified as the most prominent peak in L(·) and y Θ1 Θ2 subsequently subtracted from all spherical histograms. This Θ3 procedure is repeated until no salient peaks can be detected ΘQ in L(·). Θ Φ

. 3.2.1

Θ7

Spatial Likelihood by Histogram Projection

For the sake of readability and without loss of generality, we assume that all microphone arrays are aligned with the global coordinate system. Their locations in cartesian coordinates are denoted by rm , m = 1, 2, ..., M , the spatial likelihood function is computed for a discrete set of points gp , p = 1, 2, ..., P , in the volume of interest, and P is the total number of points where L(·) is evaluated. As illustrated in Fig. 5, we define the vector from sensor node m to grid point p as

4

p,1

.

Θ6

Θ5

vp,1

r1 gp

sensor nodes

grid points

x

Fig. 5: 2D illustration of the relation between sensor node locations rm , grid points gp , look-direction vectors vp,m and Φp,m , respectively, and the histogram codebook {θq }Q . In and express the direction of the grid point gp as seen from this example, the histogram index qˇp,1 = arg minq0 kθq0 − Φp,1 k = 5. the sensor at at position rm by the unit vector vp,m = gp − rm

(4)

vp,m . kvp,m k

(5)

Φp,m =

The histogram entry corresponding to the direction Φp,m denoted by hm [ˇ qp,m ], where qˇp,m = arg min kθq0 − Φp,m k. 0 q

Given bp , a first version of the spatial likelihood function can then be obtained by simply summing up the elements of bp , that is L0 (gp ) = bT (7) is p · 1M ×1 ,

where 1M ×1 is an all one M × 1 matrix. However, since DOA estimations from sensor nodes that are located far away (6) from g are likely to be less reliable, it seems reasonable p

(a) L0 (gp ) = bT p dp .

(b) L(gp ) = bT p dp · P(bp ).

Fig. 6: Simulated 2D spatial likelihood functions for 3 sources (crosses) and 5 receiver arrays (circles). stem from intersecting DOA estimates corresponding to different sources. In Figure 7 the occurrence of ghost sources is illustrated for a simple 2D case. The value of the spatial likelihood function at the positions of the ghost sources can be quite large, however, it can be observed that only few arrays contribute to this value. Utilizing this observation, we apply a penalty function P(bp ) to L00 which decreases with increasing sparsity of bp . We define P(bp ) =

1 G(bp ) +

(11)

where G ∈ (0, 1) is the Gini index and is a parameter that Fig. 7: Illustration of the emergence of ghost sources in controls the impact of G with > 0 to avoid singularities. the spatial likelihood function when multiple sources and The Gini index is a measure of sparsity that is commonly receivers are considered. used to assess the inequality in income or wealth and is defined as [14] to apply individual weighting factors prior to summation. L Hence, we introduce a weighting function D(kvp,m k) ∈ X c˜(l) L − l + 0.5 G(c) = 1 − 2 (12) (0, 1) that decreases with increasing distance between gp ||˜ c||1 L l=1 and rm . Here we choose D(kvp,m k) = e−

kvp,m k2 λ

c is a sorted version of c such that (8) where ˜

with the decay factor λ which depends on the size of the c˜(1) ≤ c˜(2) ≤ · · · ≤ c˜(L) . volume of interest. For the measurement results presented here λ = 12.5. With the column vector   The final version of the spatial likelihood function is thus D(kvp,1 k) given by  D(kvp,2 k)    dp =  (9)  .. L(gp ) = bT (13)   p dp · P(bp ). . D(kvp,M k) The effect of applying P to the likelihood function can be obwe obtain a second version of the spatial likelihood function served in Figure 6, where a simulated 2D spatial likelihood given by function with and without P is depicted for 5 microphone L00 (gp ) = bT (10) arrays and 3 sources. In Figure 6b it can be observed, that p dp . applying the sparseness penalty function effectively attenuIf multiple sources are active, spurious peaks in the spatial ates ghost sources and amplifies L in regions around actual likelihood function L00 can be observed. These ghost sources sources.

(a) First iteration: L(0) ([gpX gpY gpZ = 1.24]T ).

(b) Second iteration: L(1) ([gpX gpY gpZ = 1.30]T ).

Fig. 8: 2D slices of the 3D spatial likelihood function for two iterations. 3.2.2

Source Subtraction

Given the spatial likelihood function L(gp ), a single source can be localized by picking the maximum peak, that is ˆ s = arg max L(gp ). gp

(14)

For the case of multiple sources, the second highest peak in L(gp ) will most likely correspond to the position of a second source, since ghost sources are attenuated by the penalty function P(·). However, to further increase robustness, we suggest to apply an iterative source subtraction method, similar to the Acoustic Map De-Emphasis approach proposed in [12]. After selecting the dominant source in L, the corresponding angular regions in all histograms hm are attenuated by multiplication with an appropriate window function to obtain a new set of histograms hm (1) . A new spatial likelihood function L(1) is then computed from hm (1) to find the second source. This procedure is repeated until no more salient peaks can be detected in the spatial likelihood function.

R1 R2 R3 R4 S1 S2

x [m] 1.50 0.00 -1.50 0.00 -0.96 0.83

y [m] 0.00 1.50 0.00 -1.50 0.37 -1.77

z [m] 1.60 1.60 1.60 1.60 1.36 1.48

Table 1: Locations of sensor nodes (R1-R4) and speech sources (S1,S2) for the measurement setup.

signal, which in turn causes erroneous DOA estimates. A thorough evaluation of the DOA estimation performance of the Oktava 4D Ambient microphones is part of an ongoing project. Secondly, errors in the test setup concerning microphone locations and orientations directly affect the result, whereas accurate manual orientation adjustment is especially challenging. Therefore, as mentioned in Section 1, a selfcalibration method is required that accurately estimates both the microphone positions and their orientations to eliminate these errors.

4. MEASUREMENT 5. CONCLUSION AND FUTURE WORK As a proof of concept of the proposed source localization scheme, we used a test setting using 4 sensor nodes (R1-R4) and 2 static speech sources (S1,S2) with their respective locations given in Table 1. The estimation results after two iterations are depicted in Figure 8. The most prominent peak in L(0) (gp ) corresponds to source S2 and is detected at ˆ s2 = [0.54 -1.74 1.30]. After the detected source has been subtracted from all spherical DOA histograms, the second source is detected as the most prominent peak in L(1) (gp ) at ˆ s1 = [-0.87 0.6 1.24]. That is, the estimated source positions are off by 0.27 m and 0.34 m, respectively. There are two main sources of errors that contribute to this estimation error. Firstly, the deviation between ideal and real microphone characteristics result in a non-ideal B-format

In this paper we presented a 3-dimensional multiple acoustic source localization scheme for distributed microphone arrays using unsynchronized sensor nodes equipped with 4-channel tetrahedral microphone arrays. It has been shown that spurious peaks in the spatial likelihood function can be attenuated by applying a sparsity penalty function and multiple sources can be detected in an iterative fashion by applying a source subtraction scheme. To improve the localization accuracy, a self-calibration method that accurately estimates microphone positions and orientations is required. The self-calibration of sensor nodes of the WiLMA system is part of an ongoing project and outside the scope of this contribution.

ACKNOWLEDGMENT

[10] Paul Leopardi: A Partition of the Unit Sphere into Regions of Equal Area and Small Diameter, ElecThis work was supported by the project ASD, which is tronic Transactions on Numerical Analysis, vol. 25, pp. funded by Austrian ministries BMVIT, BMWFJ, the Styrian 309–327, 2006. Business Promotion Agency (SFG), and the departments 3 and 14 of the Styrian Government. The Austrian Research [11] N Madhu and R Martin: A scalable framework for Promotion Agency (FFG) conducted funding under the Commultiple speaker localization and tracking, in Propetence Centers for Excellent Technologies (COMET, Kceedings of the International Workshop on Acoustic Project), a program of the above-mentioned institutions. The Echo and Noise Control, 2008. development of the WiLMA hardware system was partly [12] Alessio Brutti, Maurizio Omologo, and Piergiorgio funded by the MINT/Masse program of the BMWF. Svaizer: Multiple Source Localization Based on Acoustic Map De-Emphasis, EURASIP Journal on Audio, Speech, and Music Processing, vol. 2010, pp. REFERENCES 1–17, 2010. [1] Shmulik Markovich-Golan, Sharon Gannot, and Israel Cohen: Blind sampling rate offset estimation and [13] Jelmer Tiete, Federico Domínguez, Bruno da Silva, Laurent Segers, Kris Steenhaut, and Abdellah Touhafi: compensation in wireless acoustic sensor networks SoundCompass: a distributed MEMS microphone with application to beamforming, Proc. IWAENC, , array-based sensor for sound source localization., no. September, pp. 4–6, 2012. Sensors (Basel, Switzerland), vol. 14, no. 2, pp. 1918– 49, Jan. 2014. [2] Alexander Bertrand: Signal processing algorithms for wireless acoustic sensor networks, Ph.D. thesis, [14] Niall P. Hurley and Scott T. Rickard: Comparing Katholieke Universiteit Leuven, 2011. Measures of Sparsity, , no. 05, pp. 17, Nov. 2008. [3] Alexander Bertrand: Applications and trends in wireless acoustic sensor networks: A signal processing perspective, 2011 18th IEEE Symposium on Communications and Vehicular Technology in the Benelux (SCVT), pp. 1–6, Nov. 2011. [4] Christian Schörkhuber, Markus Zaunschirm, and IOhannes Zmölnig: WiLMA-Wireless Largescale Microphone Array, in Linux Audio Conference 2014, 2014. [5] Michael Brandstein and Darren Ward: Microphone arrays: signal processing techniques and applications, Springer, 2001. [6] Charles Knapp and G. Clifford Carter: The generalized correlation method for estimation of time delay, Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 24, no. 4, pp. 320—-327, 1976. [7] Joseph Hector DiBiase: A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays, Ph.D. thesis, 2000. [8] Peter G. Craven and Michael A. Gerzon: Coincident microphone simulation covering three dimensional space and yielding various directional outputs, US Patent 4,042,779, 1977. [9] Ville Pulkki: Spatial sound reproduction with directional audio coding, Journal of the Audio Engineering Society, pp. 503–516, 2007.