IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
1921
Distributed Marginalized Auxiliary Particle Filter for Speaker Tracking in Distributed Microphone Networks Qiaoling Zhang, Zhe Chen, Senior Member, IEEE, and Fuliang Yin
Abstract—In this paper, a distributed marginalized auxiliary particle filter (DMAPF) is proposed for speaker tracking in distributed microphone networks. After marginalizing the state-space model, the speaker’s velocity and position are estimated using the distributed Kalman filter and the distributed auxiliary particle filter (APF), respectively. To overcome the adverse effects of noise and reverberation, a time difference of arrival selection scheme is presented to construct the local observation vector, based on the generalized cross-correlation function of the microphone pair signals at each node. Next, the multiple-hypothesis model is used as the local likelihood function of the DMAPF. Finally, the DMAPF is employed to estimate the time-varying positions of a moving speaker. The proposed method combines the strengths of the marginalized particle filter, APF, and distributed estimation. It can track the speaker successfully in noisy and reverberant environments. Moreover, it requires only local communication among neighboring nodes, and is scalable for speaker tracking. Experimental results reveal the validity of the proposed speaker tracking method. Index Terms—Distributed marginalized auxiliary particle filter (DMAPF), distributed microphone networks, speaker tracking, time difference of arrival (TDOA).
I. INTRODUCTION PEAKER localization and tracking with microphone arrays have played an important role in many applications, such as security monitoring systems, human-machine interactions, and teleconference systems [1]–[3]. Traditional microphone arrays have some limitations due to their regular geometric structure and limited physical coverage [4]. Alternatively, the distributed microphone networks (or arrays) [4]–[6] have been developed, where there are no strict constraints for the microphone deployment, thus their geometric structure could be irregular. However, most existing localization and tracking methods are based on traditional regular arrays, and they cannot work well for distributed microphone networks.
S
Manuscript received February 02, 2016; revised May 21, 2016 and June 25, 2016; accepted July 01, 2016. Date of publication July 11, 2016; date of current version August 12, 2016. This work was supported in part by the National Natural Science Foundation of China under Grants 61172107 and 61172110, in part by the National High Technology Research and Development Program 863 Program of China under Grant 2015AA016306, in part by the Major Projects in Liaoning Province Science and Technology Innovation under 201302001, and in part by the Fundamental Research Funds for the Central Universities of China under Grant DUT13LAB06. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Richard Christian Hendriks. The authors are with School of Information and Communication Engineering, Dalian University of Technology, Dalian 116023, China (e-mail:
[email protected];
[email protected];
[email protected]). Digital Object Identifier 10.1109/TASLP.2016.2590146
Recently, several literatures have been developed for speaker localization with distributed microphone networks [5]–[7]. However, these methods estimated the speaker’s position only based on the current observations. In real environments, the spurious observations due to noise or reverberation could even mask those of the true speaker, yielding poor localization performance. To address this problem, the Bayesian filter [8] deduces the current position estimate using a series of past observations as well as the current ones, and thus is more effective to cope with the adverse effects of noise and reverberation. Essentially, the Bayesian filter describes the tracking problem with a state-space model, and estimates the source state (e.g., position and velocity) by constructing the state posterior. When the state-space model is linear and Gaussian, the Kalman filter (KF) [9] is the optimal Bayesian estimator. In the speaker tracking scenarios, the observation model (typically, the time difference of arrival (TDOA)) is usually nonlinear, thus the KF may fail. In [10], by linearizing the nonlinear observation model, a distributed extended Kalman filter (EKF) was presented for speaker tracking. However, its estimation performance could deteriorate when the model is highly nonlinear. Through using the true nonlinear model and deterministically choosing a set of sample points to approximate the state distribution, the unscented Kalman filter (UKF) could obtain more accurate estimate than the EKF [9]. In [11], introducing the interacting multiple model to describe the speaker dynamics, a distributed UKF was proposed for tracking a moving speaker. Nevertheless, these methods assume that the state posterior is Gaussian. Actually, the speaker state posterior could be a multimodal density rather than Gaussian [15]. By approximating the Bayesian filter via Monte Carlo simulations, the particle filter (PF) can handle nonlinear and nonGaussian problems [8]. The core of the PF is to approximate the state posterior by a set of weighted particles drawn from a proposal function. Since the PF was introduced to speaker tracking by Vermaak and Blake [12], several advanced PFs have been developed [13]–[17]. Ward et al. [13] proposed a general PF framework for speaker tracking. Talantzis [14] combined the PF with the information theory to localize a speaker. However, these approaches employed the transition prior as the proposal function for particle filtering. Though simple, such proposal does not consider the current observations and could be sensitive to outliers [8]. To address this problem, Levy et al. [15] used the extended Kalman particle filter (EKPF) for speaker tracking, where the multi-hypothesis model was used to produce unambiguous observations which were then exploited by the EKF for the proposal function. Zhong and Hopgood [16] also employed
2329-9290 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
1922
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
the EKPF to estimate the speaker’s position, whereas the EKF was adapted by the amplitude of the TDOA observations for the proposal function. Since each particle is associated with a EKF for its proposal, the EKPF is somewhat complicated. Alternatively, the auxiliary particle filter (APF) [17] exploits the current observations for particle sampling by just introducing an auxiliary variable into the proposal function. To the best of our knowledge, the APF has not been used for speaker tracking. In many speaker tracking problems, the speaker’s position is subjected to nonlinear dynamics, while its velocity is linear dynamics [12], [13]. Actually, the marginalized particle filter (MPF) [18] was proved to work well for the tracking problems with mixed linear/nonlinear state-space models, where the linear state variable is marginalized out from the state-space model and estimated with the KF, whereas the nonlinear variable with the PF. So far, very few literatures focus on speaker tracking based on the MPF. Zhong and Hopgood [19] developed a timefrequency masking based MPF for multiple acoustic source detection and tracking, whereas the source position is marginalized and the latent variables are estimated with the PF. Motivated by the strengths of the APF and the MPF, we attempt to exploit the marginalization technique for speaker tracking, and estimate the speaker’s position and velocity with the APF and the KF, respectively. Fritsche et al. [20] proposed a marginalized auxiliary particle filter (MAPF) that combined the APF with the MPF, but it was not employed for speaker tracking. Moreover, like most existing PF based speaker tracking methods, it belongs to the centralized scheme and is unfeasible for distributed networks due to some limitations [21]. For instance, the data of all nodes have to be collected and transmitted to a central processor for data processing, whose computational and communication costs will be overwhelming as the number of nodes increases. Besides, the centralized scheme is not robust, since any failure of the central processor will lead to the disability of the whole network for tracking. To tackle these problems, the decentralized estimation methods, especially the consensusbased distributed particle filters (DPFs), have been an increasingly attractive topic in wireless sensor networks [21]–[23]. In the DPFs, no central processor is required, and all individual nodes perform local PFs for the global state estimate via only local communication among neighboring nodes. All such advantages make the DPFs an appealing alternative for distributed speaker tracking. In this paper, we propose a distributed marginalized auxiliary particle filter (DMAPF). Differing from the distributed auxiliary ¨ particle filter (DAPF) developed by Ustebay et al. [22], which aims at the nonlinear, non-Gaussian tracking problems using a selective gossip filter, the proposed DMAPF is derived for the nonlinear, non-Gaussian tracking problems where the statespace models contain linear substructures, based on the consensus filter. Using the marginalization technique, the DMAPF estimates the nonlinear and linear state variables by the distributed APF and distributed KF, respectively. After marginalizing the speaker state-space model, the DMAPF is employed for speaker tracking in distributed microphone networks. To cope with the adverse effects from noise and reverberation, a TDOA selection scheme is presented, based on the generalized cross-correlation
(GCC) function between the microphone pair signals at each node. Through the TDOA selection, the underlying unreliable TDOAs are removed, and the remaining ones are constructed as the local observation vector. Besides, a multiple-hypothesis model is used as the likelihood of the DMAPF. Finally, the DMAPF is employed to estimate the time-varying speaker positions. The proposed method combines the merits of the MPF and the APF. Moreover, it requires only local communication among neighboring nodes, and is scalable for speaker tracking. The rest of this paper is organized as follows. In Section II, the distributed microphone network and the signal model are first described. The fundamentals for the Bayesian filter, the APF, and the centralized MAPF are then reviewed. In Section III, the DMAPF algorithm is proposed. In Section IV, the DMAPFbased speaker tracking method is formulated. In Section V, the computational complexity is addressed. Section VI, simulations and real-world experiments are given. Finally, some conclusions are drawn in Section VII. II. BACKGROUND A. Distributed Microphone Networks and Signal Model Consider the speaker tracking problem in a noisy and reverberant environment, where S unsynchronized nodes constitute a distributed microphone network, and each node contains a pair of synchronized microphones. The positions of all microphones can be obtained by some self-calibration methods in advance (see e.g., [24]). The communication topology of the network is modeled as an undirected communication graph G = (V, E), where V = {1, 2, . . . , S} is the vertex set, and each vertex s ∈ V denotes a unique node, E ⊆ {V × V} is the edge set, and each edge (s, s ) ∈ E indicates a communication link between nodes s and s . The neighborhood of node s is defined as the subset Ns = {s ∈ V|(s, s ) ∈ E}, and the number of its elements is Ns . Consider a single speech source scenario. The signal acquired by the mth microphone (m = 1, 2) of node s can be modeled as [12] ysm (t) = gsm (t) y(t) + em s (t) ∀s ∈ V
(1)
where t is the discrete time index, denotes the convolution operation, y(t) is the speech source signal, gsm (t) is the room impulse response from the source to the mth microphone at node s, and em s (t) is the additive noise assumed to be uncorrelated with y(t) or em s (t) for m = m or s = s . Normally, the tracking procedure is required to operate in real time, thus the received signals of all microphones are processed in frames [13]. Let Nf and k be the length and time index of the frame, respectively. At time frame k, the collected signal of the mth microphone at node s is [12] ysm (k) = [ysm (kNf ), ysm (kNf + 1), . . . , ysm (kNf + Nf − 1)]. (2)
Assume that the speaker will not move significantly during one time frame. The tracking problem is to estimate the speaker’s position at each frame based on the acquired signals of all microphones.
ZHANG et al.: DMAPF FOR SPEAKER TRACKING IN DISTRIBUTED MICROPHONE NETWORKS
The weight wki associated with Xki is calculated as [17]
B. Bayesian Filter for Tracking Problem Let xk denote the spatial information (e.g., position and velocity) of a moving target at time k. Each node s in the network makes an observation zs,k about xk . The system dynamics can be described by a state-space model as [8] xk = fk (xk −1, uk ) zs,k = hs,k (xk, vs,k ) ∀s ∈ V
p(zk |xk )p(xk |z1:k −1 ) p(zk |xk )p(xk |z1:k −1 )dxk
(3b)
(4b)
where p(xk |xk −1 ) is the transition density, p(zk |xk ) is the likelihood density, and p(xk |z1:k −1 ) is the predicted density of xk given the previous observations z1:k −1 . When the model in (3) is linear and Gaussian, the KF [9] could provide the optimal solution to the Bayesian filter. For nonlinear and non-Gaussian cases, the PF [8] has been proved to work well. C. Auxiliary Particle Filter The PF [8] is a sequential Monte Carlo approximation for the Bayesian filter. Its key idea is to approximate the posterior p(xk |z1:k ) with a set of N weighted particles {Xki , wki }N i=1 , where Xki is the ith particle drawn from a proposal function q(xk |z1:k ), and wki is the weight associated with Xki . Usually, the transition prior p(xk |xk −1 ) is employed as the proposal function [12], [13]. Though easily sampled, such proposal does not consider the current observations, and could be sensitive to outliers [8]. To tackle this problem, the APF [17] draws the current particles from the previous ones based on the current observations, where the generated particles are most likely to be close to the true state. Specifically, the APF attempts to draw pairs (Xki , j i ), i = 1, 2, . . . , N , from the proposal function q(xk , j|z1:k ) [17] (Xki , j i ) ∼ q(xk , j|z1:k ) = p(xk |Xkj −1 )p(zk |μjk )wkj −1
wki =
(5)
where ∼ denotes the sampling operation, j is an auxiliary variable that is used simply to aid to draw particles, j i is the index of the particle at time k − 1 corresponding to Xki , and μjk is the statistical characterization of xk given Xkj −1 .
p(Xki , j i |z1:k ) q(Xki , j i |z1:k )
(6) i
=
(3a)
where fk is a general transition function, uk is the process noise, hs,k and vs,k are the observation function and noise of node s, respectively. The overall observation vector across the entire network is zk = [(z1,k )T , (z2,k )T , . . . , (zS,k )T ]T , where superscript T denotes the matrix transpose. The Bayesian filter for the tracking problem sequentially estimates the posterior density p(xk |z1:k ) of xk conditioned on all available observations up to time step k, i.e., z1:k = {z1 , z2 , . . . , zk }. Based upon such posterior, the minimum mean-square error (MMSE) estimate of xk could be obtained [8]. Theoretically, given the prior density p(x0 ) and the previous density p(xk −1 |z1:k −1 ), the posterior p(xk |z1:k ) can be computed using the following recursion [8] p(xk |z1:k −1 ) = p(xk |xk −1 )p(xk −1 |z1:k −1 )dxk −1 (4a) p(xk |z1:k ) =
1923
i
p(zk |Xki )p(Xki |Xkj −1 )wkj −1 i i i p(Xki |μjk )p(Xki |Xkj −1 )wkj −1
=
p(zk |Xki ) i
p(zk |μjk )
.
With the weighted particle approximation of the posterior p(xk |z1:k ), the MMSE estimate of xk can be obtained as x ˆMMSE ≈ k
xk
N
w ˜ki δ(xk − Xki )dxk =
i=1
N
w ˜ki Xki
(7)
i=1
where δ denotes the Dirac delta function, and w ˜ki is the normalN ized weight, i.e., w ˜ki = wki / wkj . j =1
D. Centralized Marginalized Auxiliary Particle Filter In many tracking problems [18], the state vector xk can be partitioned into the nonlinear variable xnk and linear variable xlk . To address such problems, the MPF [19] attempts to marginalize the linear state variable xlk out from the posterior p(xk |z1:k ), i.e., p(xk |z1:k ) = p(xlk , xnk |z1:k ) = p(xlk |xnk , z1:k )p(xnk |z1:k ) (8) and estimates the posterior densities p(xnk |z1:k ) and p(xlk |xnk , z1:k ) with the PF and the KF, respectively. Since the dimension of p(xnk |z1:k ) is lower than that of p(xk |z1:k ) and the linear variable is estimated with the optimal linear filter, compared with the general PF that estimates p(xk |z1:k ) straightforward, the MPF undergoes particle filtering in a lower dimensional state space, and it could produce estimates with lower covariance [18]. Furthermore, in many cases, the computational complexity of the MPF can be significantly reduced [18], [20], [25]. Combing the merits of the APF and the MPF, Fritsche et al. [20] proposed a MAPF. However, it belongs to the centralized scheme, and is unfeasible for distributed microphone networks due to some limitations [21], such as (i) large computational and communication burdens for the central processor, since observations of all nodes have to be collected and transmitted to it for data processing; (ii) poor robustness, i.e., any failure of the central processor will lead to the disability of the entire network for tracking; and (iii) the lack of scalability, as adding extra nodes will increase the computational and communication costs of the central processor significantly, especially for the speaker tracking scenarios. III. DISTRIBUTED MARGINALIZED AUXILIARY PARTICLE FILTER Since the centralized MAPF is unfeasible for distributed microphone networks due to several limitations, a DMAPF is proposed in this section such that each node can obtain the global state estimate only based on local communications among neighboring nodes.
1924
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
Suppose that the state-space model in (3) can be partitioned as the following form [18] xnk
Gnk−1 unk−1
(9a)
xlk = fkl (xnk−1 ) + Flk −1 xlk −1 + Glk −1 ulk −1
(9b)
=
fkn (xnk−1 )
+
Fnk−1 xlk −1
+
zs,k = hs,k (xnk ) + Hs,k xlk + vs,k ∀s ∈ V
(9c)
where fkn and fkl are general nonlinear functions, Fnk−1 , Flk −1 , Gnk−1 , and Glk −1 are the real transition matrices, Hs,k is the observation matrix, unk and ulk are the process noises for the nonlinear and linear variables, respectively. Assume that unk , ulk , and vs,k are zero-mean, white Gaussian noises, with ⎧⎡ un ⎪ ⎪ ⎨⎢ k l E ⎢ ⎣ uk ⎪ ⎪ ⎩ vs , k
⎤T ⎫ ⎡ n Q k δk k ⎪ ⎪ ⎥⎢ ⎥ ⎬ ⎢ ⎢ ⎥ ⎢ ul ⎥ ⎦⎣ k ⎦ ⎪ = ⎣ 0 ⎪ ⎭ 0 vs , k ⎤⎡
unk
0
0
Qlk δk k
0
0
⎤ ⎥ ⎥ ⎦
R s , k δk k δs s (10)
where E denotes the mathematical expectation, Qnk , Qlk , and Rs,k are the covariance matrices of unk , ulk , and vs,k , respectively. The objective here is to estimate the joint posterior p(xlk , xnk , j|z1:k ) in a distributed fashion based on the statespace model in (9). Note from (9) that xnk is subjected to nonlinear dynamics, whereas xlk is linear dynamics conditioned on xnk . Therefore, we can estimate the joint posterior p(xlk , xnk , j|z1:k ) according to p(xlk , xnk , j|z1:k )
=
p(xlk |xnk , j, z1:k ) p(xnk , j|z1:k )
distributed KF
,j ˆ n ,j , Pn ,j ) p(xnk |Xkn−1 ) = N (xnk ; X k |k −1 k |k −1
where the posterior p(xnk , j|z1:k ) is obtained with the distributed APF, and p(xlk |xnk , j, z1:k ) with the distributed KF. Specifically, assume that the local random number generator at each node is synchronized (i.e., the same pseudo-random numbers are generated among all nodes), and identical parti,i ,i N , wkn−1 }i=1 of the previous posterior cle representations {Xkn−1 p(xnk−1 |z1:k −1 ) are achieved by every node at time k − 1. For ,i ˆ l,i , the associated linear variable X each particle Xkn−1 k −1|k −1 and its covariance Pl,i k −1|k −1 are also obtained by every node. The local MAPF for node s at time k works as follows. A. Posterior Estimation of Nonlinear Variable xnk The posterior p(xnk , j|z1:k ) of the nonlinear variable xnk is estimated via the distributed APF. In the PF prediction step, the following sampling are performed ,j ,j )p(zk |μjk )wkn−1 . (Xkn ,i , j i ) ∼ q(xnk , j|z1:k ) = p(xk |Xkn−1 (12) For such sampling, we have to predetermine the variable μjk ,j and the transition density p(xnk |Xkn−1 ). Herein, we choose μjk ,j as the conditional expectation, i.e., μjk = E{xk |Xkn−1 }. Based j on (9a), μk is computed as
(13)
(14)
ˆ , P) denotes the Gaussian density function of where N (x; x ˆ and covariance P, and random variable x, with mean x ˆ l,j ˆ n ,j = fkn (X n ,j ) + Fnk−1 X X k −1 k |k −1 k −1|k −1
(15a)
,j l,j n n T n n n T Pnk |k −1 = Fk −1 Pk −1|k −1 [Fk −1 ] + Gk −1 Qk −1 [Gk −1 ] . (15b)
In the PF update step, the weight associated with each particle Xkn ,i is computed as p(zk |Xkn ,i )
wkn ,i =
i
p(zk |μjk )
.
(16)
Notice from (12) and (16) that both the proposal function and particle weights require the overall observations of all nodes, and such observations exist in the form of the global likelihood. Thus, the distributed computation of the proposal function and particle weights can be implemented via the consensus calculation of the global likelihood [21]. Assume that all local observations are conditionally independent given a state. Then, the global likelihood can be factorized as the product of all local likelihoods p(zk |xnk ) =
S
p(zs,k |xnk ).
(17)
s=1
(11)
distributed APF
,j ˆ l,j μjk = fkn (Xkn−1 ) + Fnk−1 X k −1|k −1
,j ) is given by and the transition density p(xnk |Xkn−1
Taking natural exponent and then logarithm for both sides of (17), it follows that 1 S p(zk |xnk ) = exp(S · log p(zs,k |xnk )) (18) s=1 S A (z k ,x nk )
where by setting ψs,k (0) = log p(zs,k |xnk ), the global average A(zs,k , xnk ) can be asymptotically obtained by every node via the following consensus iteration [23] ψs,k () = ξss ψs ,k ( − 1) (19) s ∈N s
{s}
where is the consensus iteration index, and ξss is the consensus weight between nodes s and s , whose setup has to guarantee the convergence of the consensus algorithm [23]. Normally, for a fixed graph, the consensus weights can be predetermined using semi-definite programming for the fastest asymptotic convergence rate [23]. For the time-varying case, it is preferable to compute the weights according to the instantaneous graph [4]. We refer the reader to [4], [23] for the design of consensus weights. In this paper, the Metropolis weights are preferred since they require only local information and are feasible for distributed implementation [23]. The consensus iteration at each node s is shown in Algorithm 1. Noticeably, the definition of the local likelihood p(zs,k |xnk ) is problem-dependent. Typically, it could be a Gaussian density determined by the observation noise statistics [21] or a multiplehypothesis expression in speaker tracking (see Section IV-D).
ZHANG et al.: DMAPF FOR SPEAKER TRACKING IN DISTRIBUTED MICROPHONE NETWORKS
1925
T ]T . According to the assumption in (10), it follows that vS,k
Algorithm 1: Average Consensus Filter. Input: ψs,k ( − 1) Output: ψs,k () 1: Send ψs,k ( − 1) to its all neighbors s ∈ Ns 2: Receive ψs ,k ( − 1) from its all neighbors s ∈ Ns 3: Update ψs,k () with (19).
HTk R−1 k Hk =
S
HTs,k R−1 s,k Hs,k
(23a)
˜s,k . HTs,k R−1 s,k z
(23b)
s=1
˜k = HTk R−1 k z
S s=1
B. Conditional Posterior Estimation of Linear Variable xlk The conditional posterior p(xlk |xnk , j, z1:k ) of the linear variable xlk can be obtained with the distributed KF as follows. In the KF prediction step, when Xkn ,i and j i are available, the predicted density p(xlk |Xkn ,i , j i , z1:k −1 ) of xlk can be computed. ˆ l,i , Pl,i ), Based upon (9a) and (9b), it is given by N (xlk ; X k |k −1 k |k −1 where [20] n ,j i l l ˆ l,i ˆ l,j i X k |k −1 = fk (Xk −1 ) + Fk −1 Xk −1|k −1
+ Pl,i k |k −1
=
Lik (Xkn ,i
−
,j i fkn (Xkn−1 )
−
(20a)
ˆ l,j i Fnk−1 X k −1|k −1 )
i l T Flk −1 Pl,j k −1|k −1 [Fk −1 ]
+ Glk −1 Qlk −1 [Glk −1 ]T − Lik Mik [Lik ]T i
n T i −1 Lik = Flk −1 Pl,j k −1|k −1 [Fk −1 ] [Mk ]
(20b) (20c)
i
n T n n n T Mik = Fnk−1 Pl,j k −1|k −1 [Fk −1 ] + Gk −1 Qk −1 [Gk −1 ] . (20d)
In the KF update step, when the observation zs,k is available for node s, the local posterior p(xlk |Xkn ,i , j i , z1:k −1 , zs,k ) ˆ l,i , Pl,i ) can be obtained according to given by N (xlk ; X s,k |k s,k |k (9b) and (9c). For such posterior estimation, the update equations of the information filter (mathematically equivalent to the KF) are herein preferred due to their superiority over the KF in distributed data fusion [26] −1 −1 = [Pl,i + HTs,k R−1 [Pl,i s,k Hs,k s,k |k ] k |k −1 ]
(21a)
l,i −1 ˆ l,i −1 ˆ l,i T −1 ˜ [Pl,i s,k |k ] Xs,k |k = [Pk |k −1 ] Xk |k −1 + Hs,k Rs,k zs,k (21b)
˜s,k = zs,k − hs,k (Xkn ,i ). where z Employing similar update equations for the global posterior ˆ l,i , Pl,i ), we have p(xlk |Xkn ,i , j i , z1:k ) given by N (xk ; X k |k k |k −1 −1 [Pl,i = [Pl,i + HTk R−1 k Hk k |k ] k |k −1 ] l,i −1 ˆ l,i −1 ˆ l,i T −1 ˜ [Pl,i k |k ] Xk |k = [Pk |k −1 ] Xk |k −1 + Hk Rk zk
where
Hk = [HT1,k , HT2,k , . . . , HTS,k ]T ,
(22a) (22b)
˜k = [˜ ˜T2,k , z zT1,k , z
T T ˜TS,k ]T , and Rk = E{vk vkT }, where vk = [v1,k , v2,k ,..., ...,z
To circumvent the matrix inversion in data fusion, based on (21)–(23), we obtain the following fusion rule −1 S −1 Pl,i = [Pl,i + HTs,k R−1 s,k Hs,k (24a) k |k k |k −1 ] s=1 Infomx l,i −1 ˆ l,i −1 ˆ l,i [Pl,i k |k ] Xk |k = [Pk |k −1 ] Xk |k −1 +
S
s=1
˜s,k . HTs,k R−1 s,k z info
(24b) For the convenience, we define T −1 ˜s,k . Us,k = HTs,k R−1 s,k Hs,k , Vs,k = Hs,k Rs,k z
(25)
Set ψs,k (0) = Us,k , and the term “informx” can be achieved by all nodes via Algorithm 1. Likewise, the term “info” can be obtained with Algorithm 1 by setting ψs,k (0) = Vs,k . Note from (20b)–(20d) and (24a) that the prediction and update for the covariance Plk are independent of all particles, and thus l i i Pl,i k |k = Pk |k , Lk = Lk , Mk = Mk , i = 1, 2, . . . , N. (26)
To summarize, the proposed DMAPF is depicted in Algorithm 2. The DMAPF provides a general distributed framework for the tracking problems where a linear substructure exists in the state-space model. It combines the merits of the MPF, APF and distributed estimation. Both the proposal function and particle weights incorporate the overall information of all nodes, indicating good estimation accuracy. Moreover, the DMAPF requires no fusion center and only local communications among neighboring nodes, and is scalable for target tracking. IV. DMAPF-BASED SPEAKER TRACKING In this section, the DMAPF algorithm is applied for speaker tracking in distributed microphone networks. First, the marginalization of the speaker state space is described. To overcome the adverse effects from noise and reverberation, a TDOA selection scheme is presented to construct the local observation vector, and then a multiple-hypothesis model is used as the likelihood of the DMAPF. Finally, a DMAPF-based speaker tracking method is formulated. A. Marginalization for Speaker State Space Usually, the height of a speaker is fixed during the conversation for a reasonable period. Without loss of generality, we consider the two-dimensional (2-D) tracking scenarios. The speaker state at time k is defined as xk = [xk , yk , x˙ k , y˙ k ]T , where (xk , yk ) and (x˙ k , y˙ k ) are the position and velocity, respectively. The speaker dynamics is described by the well-known
1926
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
Algorithm 2: Distributed Marginalized Auxiliary Particle Filter. Initialization: X0n ,i ∼ p(xn0 ), w0n ,i = 1/N , ˆ l,i = X l , Pl = Pl , i = 1, 2, . . . , N. X 0 0 0|0 0|0 ˆ l,i Input: X n ,i , wn ,i , X , Pl k −1 k −1 Output:Xkn ,i , wkn ,i ,
k −1|k −1
k −1|k −1
ˆ l,i , Pl X k |k k |k Iteration at each node s, ∀s ∈ V 1: PF prediction: Draw (Xkn ,i , j i ) from the proposal function p(xnk , j|z1:k ) • Compute the variable μjk with (13). • Calculate the local likelihood p(zs,k |μjk ). • Compute A(zs,k , μjk ) with Algorithm 1. • Compute the global likelihood p(zk |μjk ) with (18). ,j • Compute the weights: ηkj = wkn−1 p(zk |μjk ). N • Normalized the weights: η˜kj = ηkj / ηki .
j =1
ˆ l,i and Pl based on X n ,i 4: KF update: Update X k k |k k |k and zk . • Compute the Us,k and Vs,k with (25). • Set ψs,k (0) = Us,k , and estimate the term “infomx” with Algorithm 1. • Set ψs,k (0) = Vs,k , and estimate the term “info” with Algorithm 1. • Compute Xkl,i|k and Plk |k with (24).
xk =
0 +
a ⊗ I2
(28a)
xlk
(28b)
=
axlk −1
+
bulk −1
∀s ∈ V (28c)
where c is the sound propagation speed, and · denotes the Euclidian norm, r1s and r2s are the positions of the two microphones at node s, and rnk = [(xnk )T , Z1 ]T is the 3-D coordinate related to xnk , where Z1 is the height of the speaker. Note from (28) that the speaker’s position is subjected to nonlinear dynamics, whereas its velocity is conditionally linear dynamics. Thus, we can marginalize the velocity out from the speaker state, and then introduce the DMAPF algorithm into speaker tracking. Actually, (28) is a simple case of (9), with
i=1
I2 aΔT ⊗ I2
xnk = xnk−1 + aΔT xlk −1 + bΔT unk−1 zs,k = c−1 ( rnk − r1s − rnk − r2s ) + vs,k
˜kj . • Choose an index j i from {j}N j =1 according to η n ,i • Sample the particle Xk from the transition ,j i ) given by (14). prior p(xnk |Xkn−1 l ˆ l,i 2: KF prediction: Compute X k |k −1 and Pk |k −1 with (20). 3: PF update: Update the particle weights • Calculate the local likelihood p(zs,k |Xkn ,i ). • Compute A(zs,k , Xkn ,i ) with Algorithm 1. • Compute the global likelihood p(zk |Xkn ,i ) with (18). • Calculate the particle weight wkn ,i with (16). N • Normalize the weights: w ˜kn ,i = wkn ,i / wkn ,j .
Langevin model [12]
such partition into the dynamical model (27) and using the TDOA as the local observation, the speaker state-space model can be expressed as
fkn (xnk−1 ) = xnk−1 , fkl (xnk−1 ) = 0, Fnk−1 = aΔT, Flk −1 = a (29) Gnk−1 = bΔT, Glk −1 = b, Hs,k = 0, Qnk = Qlk = I2 . The application of the DMAPF algorithm for speaker tracking is given as follows. B. Local MAPFs for Speaker Tracking In the PF prediction step, for the global proposal function q(xnk , j|z1:k ), based on (29), the variable μjk in (13) can be rewritten as ,j ˆ l,j + aΔT X μjk = Xkn−1 k −1|k −1
(30)
,j and the transition density p(xnk |Xkn−1 ) is again given by (14), ˆ n ,j and Pn ,j becomes whereas the parameters X k |k −1 k |k −1
ˆ n ,j = X n ,j + aΔT X ˆ l,j X k −1 k |k −1 k −1|k −1
(31a)
,j 2 2 l 2 2 Pnk |k −1 = a ΔT Pk −1|k −1 + b ΔT I2 .
(31b)
In the KF prediction step, based upon (20) and (29), the conditional posterior p(xlk |Xkn ,i , j i , z1:k −1 ) is given by ˆ l,i , Pl N (xlk ; X k |k −1 ), with k |k −1
xk −1
bΔT ⊗ I2
0
0
b ⊗ I2
(27)
uk −1
where Iρ denotes the ρ-order identical matrix, ⊗ is the Kronecker product, T is the time interval between two successive −β Δ T and position √ estimation, and a and b are given by a = e 2 b = υ¯ 1 − a , where β and υ¯ are the rate constant and steady velocity parameter, respectively. Define xnk and xlk as the speaker’s position and velocity, i.e., n xk = [xk , yk ]T and xlk = [x˙ k , y˙ k ]T , respectively, and then the speaker state can be split as xk = [(xnk )T , (xlk )T ]T . Substituting
l,j i n ,i n ,j i ˆ l,i ˆ l,j i X k |k −1 = aXk −1|k −1 + Lk (Xk − Xk −1 − aΔT Xk −1|k −1 ) (32a)
Plk |k −1 = a2 Plk −1|k −1 + b2 I2 − Lk Mk LTk Lk = a2 ΔT Plk −1|k −1 M−1 k Mk = a2 ΔT 2 Plk −1|k −1 + b2 ΔT 2 I2 .
(32b) (32c) (32d)
In the PF update step, each particle weight in (16) can be obtained via the consensus computation of the global likelihood (see Section IV-D). Noticeably, in order to cope with noise and reverberation in realistic environments, a multiplehypothesis model is later employed as the local likelihood (see Section IV-D).
ZHANG et al.: DMAPF FOR SPEAKER TRACKING IN DISTRIBUTED MICROPHONE NETWORKS
In the KF update step, consider the update of the velocity variable xlk . According to (24), (26) and (29), we have ˆ l,i = X ˆ l,i X k |k k |k −1
(33a)
Plk |k = Plk |k −1 .
(33b)
Actually, the observation model (28c) involves no information of xlk , thus no innovation is introduced into its update. C. TDOA Selection at Local Nodes The GCC function Rs,k (τ ) between the microphone pair signals at node s can be computed as [27] Rs,k (τ ) =
∗
Ys1 (f )Ys2 (f ) j 2π f τ e df |Ys1 (f )Ys2 ∗ (f )|
(34)
where Ys1 (f ) and Ys2 (f ) represent the frequency-domain counterparts of ys1 (k) and ys2 (k), respectively, and ∗ denotes the complex conjugation. The peaks of the GCC have been exploited for the TDOA estimation successfully [6], [7], [12], [13]. It has been shown that the TDOAs from the large GCC peaks are reliable than others [6], [28]. Traditionally, the TDOA is estimated as the time lag related to the largest GCC peak[11], [28]. In real environments, the spurious peaks due to noise or reverberation may even mask that of the true source [12], [13]. To address this problem, Canclini et al. [6] developed an energy ratio to determine the reliability of TDOAs. However, choosing only the maximum peak of the GCC for the TDOA, these methods are vulnerable to noise and reverberation and could yield ambiguous estimate. Alternatively, extracting multiple TDOA candidates from some local maxima of the GCC turns out to be effective in speaker tracking [12]–[15]. Inspired by the works of [6] and [12], a TDOA selection scheme is herein presented for the local observation, which is described as follows. 1) Step (i) Select P delays corresponding to the largest local maxima of Rk ,s (τ ) for the set Z0 , (1)
(2)
(P )
Z0 = {ˆ τs,k , τˆs,k , . . . , τˆs,k }
(35)
(p)
where τˆs,k ∈ [−τsm ax , τsm ax ] (p = 1, 2, . . . , P ) is the delay related to the pth largest local maximum of Rs,k (τ ), and τsm ax = c−1 r1s − r2s is the maximum possible TDOA of node s. (p) (p) 2) Step (ii) Calculate the energy ratio ηs,k for each delay τˆs,k in Z0 , [Rs,k (ι)]2 (p )
(p)
ηs,k =
ι∈Ds , k
[Rs,k (ι)]2
(36)
(p )
ι∈ / Ds , k
(p)
(p)
where ι is the discrete time tag of τ , ιs,k is the tag of τˆs,k , (p)
(p)
(p)
and Ds,k = [ιs,k − p0 , ιs,k + p0 ] is an interval centered at (p)
ιs,k , where p0 is a positive integer.
1927
3) Step (iii) Select Nc delays with the largest energy ratios from Z0 to construct the local observation vector T (ι ) (ι ) (ι ) (37) zs,k = τˆs,k1 , τˆs,k2 , . . . , τˆs,kN c (ι )
where τˆs,kp
(p = 1, 2, . . . , Nc ) is the delay related to (p)
the pth largest energy ratio ηs,k , and it is deemed as a candidate for the TDOA. Nc is empirically determined, where there is a tradeoff between the performance and the computational cost. D. Multiple-Hypothesis Model for Local Likelihoods Consider the local likelihood p(zs,k |xnk ). Generally, among all the TDOA candidates in the local observation vector zs,k , at most one is associated with the true source (speaker), whereas the others are from the spurious sources due to noise or reverberation [12], [13]. Hence, the multiple-hypothesis model is adopted as the likelihood [12] Nc q0 (ι p ) n n 2 p(zs,k |xk ) = κ + qp N (ˆ τs,k ; τs,k (xk ), σ ) 2τsm ax p=1 (38) where κ = (2τsm ax )1−N c , q0 denotes the probability that none of the TDOA candidates is attributed to the true source, and qp (p 1 =, 2, . . . , Nc ) is the probability that the ιp th TDOA can(ι ) didate τˆs,kp is from the true source, whereas the rest are from the spurious sources, σ is the observation noise standard deviation, and τs,k (xnk ) = c−1 (||rnk − r1s || − ||rnk − r2s ||)
(39)
is the theoretic TDOA of node s for xnk . Based upon the discussions above, the DMAPF-based speaker tracking method is given in Algorithm 3, where K is the number of frames undergoing the position estimation. Through marginalizing the speaker state space, particle filtering is performed in a lower dimensional state space. The proposed method also exploits the current observations of all nodes in the proposal function for particle sampling. Besides, by using the TDOA selection scheme, underlying unreliable observations are eliminated. Furthermore, the proposed method is implemented in a distributed fashion via the consensus algorithm, thus it requires only local communications among neighboring nodes, and is scalable for speaker tracking. V. COMPUTATIONAL COMPLEXITY ANALYSIS In this section, the computational complexity analysis is made for the proposed DMAPF as well as several existing algorithms, including the SIR-PF [13], the EKPF [16], the centralized APF (CAPF) [17], and the centralized MAPF (CMAPF) [20]. Since all the microphone signals are processed in frames, the computational cost per frame is analyzed. Noticeably, the DMAPF is a distributed estimation method where all nodes perform similarly, thus the computational cost at each node are concerned. Let n, l, nz denote the dimensions of xnk , xlk , and zs,k , respectively, and C the number of consensus iterations. Assume
1928
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
Algorithm 3: DMAPF-Based Speaker Tracking. Initialization: X0n ,i ∼ p(xn0 ), ˆ l,i , Pl } = {X l , Pl }, i = 1, 2, . . . , N. wn ,i = 1/N ,{X 0
0 0 0|0 0|0 ,i ,i l ˆ l,i Xkn−1 , wkn−1 ,X , P k −1|k −1 , k −1|k −1
Inputs: ys1 (k), ys2 (k). ˆ nk . Output: Estimate of the current speaker position: x Iteration at each node ∀s ∈ v, for k = 1, 2, . . . , K. 1: Compute the GCC function Rs,k (τ ) with (34). 2: Construct the local observation vector zs,k with (37). 3: Compute the variable μjk with (30). 4: Calculate the local likelihood p(zsk |μjk ) with (38). 5: Compute A(zsk , μjk ) with Algorithm 1. 6: Compute the global likelihood p(zk |μjk ) with (18). ,i 7: Compute the weights: ηkj = wkn−1 p(zk |μjk ). N 8: Normalized the weights: η˜kj = ηkj / ηki . i=1
˜kj . 9: Choose index j i from {j}N j =1 according to η 10: Sample particle Xkn ,i from the transition prior ,j i ) given by (31). p(xnk |Xkn−1 l ˆ l,i 11: Compute X k |k −1 and Pk |k −1 with (32).
Calculate the local likelihood p(zsk |Xkn ,i ) with (38). Compute A(zlk , Xkn ,i ) with Algorithm 1. Compute the global likelihood p(zk |Xkn ,i ) with (18). Calculate the particle weight wkn ,i with (16), and normalize it. ˆ l,i and Pl with (33). 16: Update X k |k k |k ˆ nk with (7). 17: Estimate the current speaker position x
12: 13: 14: 15:
TABLE I EXPLANATIONS OF RELATED NOTATIONS Notations
Basic computational cost of related operations
Φ1 Φ2 Φ3 Φ4 Φ5 Ψ1 , Ψ2 , Ψ3 Ψ4 , Ψ5 , Ψ6
Calculation of a general likelihood Logarithmic computation of a scalar Nature exponential computation of a scalar Sampling an integer with a certain probability Sampling from a Gaussian density Computations of f kn (·), f kl (·), and h s, k (·), respectively Computations for the Jacobian matrices of f kn (·), f kl (·), and h s, k (·) with respect to x nk , respectively
that one addition, multiplication, and division of two floatingpoint numbers are defined as basic floating operations (Flops) and they have the same cost. The notation explanations for other operations involved in computational costs are given in Table I. For the state-space model (9), the overall computational costs in terms of Flops for all aforementioned methods are given in Table II. The results show that the computational costs of all these methods depend on the dimensions of the related variables and the number of particles. In addition, for the centralized methods (CMAPF, CAPF, EKPF and SIR-PF), the costs also rely on the number of nodes. This is because the overall observations of all nodes are processed at the central processor. For the DMAPF, each node acts as a central processor, and its computational cost depends on the number of consensus iterations
TABLE II COMPUTATIONAL COST COMPARISONS FOR GENERAL STATE-SPACE MODEL (9) Algorithms
DMAPF CMAPF CAPF EKPF SIR-PF
Flops
Others
O (N (2Φ 1 + 2Φ 2 +2Φ 3 + Φ 4 + Φ 5 +2Ψ 1 + Ψ 2 + Ψ 3 )) O (N (6l 2 + 2S n z l + 6n l) O (N (2S Φ 1 + Φ 4 + Φ 5 +2Ψ 1 + Ψ 2 + S Ψ 3 )) +2(n + l) 3 + 2S n 2z l + S n z l 2 ) O (N (2S Φ 1 + Φ 4 O (N (2n 2 + 4l 2 + 2n l + 2S ) + 3(n + l) 3 ) +Φ 5 + 2Ψ 1 + 2Ψ 2 )) O (N (S Φ 1 + Φ 4 + Φ 5 O (N ((n + l + S n z ) 3 + 3(n + l) 2 S n z +Ψ 4 + Ψ 5 + S Ψ 6 )) +(n + l)S 2 n 2z ) + 2(n + l) 3 ) O (N (S Φ 1 + Φ 4 O (N (2l 2 + 2n l + S ) + 2(n + l) 3 ) +Φ 5 + Ψ 1 + Ψ 2 )) O (N (6l 3 + 2C N s l + 8n l + 2n z l) +2(n + l) 3 + 2C N s l 2 + 2n z l 2 + 4n 2z l)
TABLE III COMPUTATIONAL COST COMPARISONS FOR SPEAKER TRACKING MODEL (28) (n = l = 2, n z = 1) Algorithms DMAPF CMAPF CAPF EKPF SIR-PF
Flops
Others
O (N (4C N s + 4N c + 32)) O (N (4S N c + 4S + 30)) O (N (4S N c + 4S + 18)) O (N (16S 2 + 2S N c + 102S )) O (N (2S N c + 2S + 14))
O (N (2Φ 2 + 2Φ 3 + Φ 4 + Φ 5 + 2Ψ 3 )) O (N (Φ 4 + Φ 5 + 2S Ψ 3 )) O (N (Φ 4 + Φ 5 + 2S Ψ 3 )) O (N (Φ 4 + Φ 5 + S Ψ 3 + S Ψ 6 )) O (N (Φ 4 + Φ 5 + S Ψ 3 ))
and the number of its neighbors, which are determined by the consensus weights and communication graph [23], respectively. In particular, for the speaker tracking model (28), the computational costs of the preceding methods are also analyzed, and the results are given in Table III. It is observable that the computational costs in terms of Flops for the CMAPF, CAPF and SIR-PF are close to each other, and they are lower than that of the EKPF. Normally, the values of Nc and Ns are relatively small. When the value of S is comparable to C, the Flops of the DMAPF is comparable to the CMAPF, CAPF, and SIR-PF. As the number S of nodes becomes larger than C, the computational costs of the centralized methods, i.e., CMAPF, CAPF, EKPF and SIR-PF, will increase dramatically, and they will be much larger than that of the DMAPF. For the TDOA selection, the computational costs at each node are summarized as follows. 1) Multiplications: 2Nf log2 (2Nf ) + 12Nf +2P Nf 2) Additions: 4Nf log2 (2Nf ) + 6Nf 3) Divisions: P . Such computation costs mainly rely on the frame length Nf and the number P of underlying time delays both of which are relatively small for the realistic tracking scenarios. VI. EXPERIMENTS AND RESULT DISCUSSIONS To evaluate the validity of the proposed speaker tracking method, comparative experiments are carried out and the result discussions are given in this section. A. Simulation Setup The simulated environment is a typical room of size 5 m × 5 m × 3 m, where a total of S = 12 randomly deployed microphone pairs comprise a distributed network, and
ZHANG et al.: DMAPF FOR SPEAKER TRACKING IN DISTRIBUTED MICROPHONE NETWORKS
1929
Fig. 1. Layout of microphone positions (circles) and the speaker trajectory (semi-circle curve).
Fig. 3. Speaker tracking results when SNR =10 dB and T 6 0 = 0.1 s: (a) X -dimension, (b) Y -dimension.
Fig. 2. Communication graph G1 of the distributed microphone network (circles represent nodes).
the inter-microphone distance of each pair is 0.6 m. For simplicity, the microphone network is deployed on a plane at the height of Z0 = 1.7 m, as depicted in Fig. 1. Fig. 2 is a typical communication graph G1 of the network with the communication radius of R = 1.8 m. The speaker is simulated to move along a curl from (1.22, 2.73) m to (3.60, 4.02) m at the height of Z1 = 1.5 m, as shown in Fig. 1. Considering that the moving speaker within a room is likely to be slow-paced, the speaker’s speed is set as 0.5 m/s, which is about one third of a regular pedestrian walking speed [16]. The room impulse responses between the speaker and microphones are simulated with the Image method [29], where different values of the reverberation time T60 are set to simulate diverse reverberant conditions. The sound propagation speed is c = 343 m/s. The original speaker signal is a 9.6 s male speech utterance sampled at fs = 16 kHz, and it is segmented into K = 80 frames along the speaker trajectory with the length of ΔT = 120 ms. These signal frames are first convolved with the corresponding room impulse responses, and then added by
Fig. 4. Averaged RMSE results for different methods versus T 6 0 with SNR = 20 dB.
the white Gaussian noise, yielding the noisy and reverberant microphone signals. Different covariances of the white Gaussian noise determine a set of signal-noise-ratio (SNR) values, reflecting diverse background noise conditions. The parameters for the TDOA selection are P = 10, p0 = 4 and Nc = 4. The number of particles in particle filtering is N = 500. The initial prior of the speaker position could
1930
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
Fig. 5. Averaged RMSE results for different methods versus SNR with T 6 0 = 0.1 s.
be estimated via some localization methods beforehand [6], [7], whereas this paper mainly focuses on speaker tracking only. Thus, for all the methods, the initial prior p(xn0 ) of the speaker position is set as a Gaussian density with mean μn0 = [1.22, 2.73]T and covariance Σn0 = diag([0.05, 0.05]), and the initial prior p(xl0 ) of the speaker velocity is set as a Gaussian density with mean μl0 = [0.05, 0.05]T and covariance Σl0 = diag([0.0025, 0.0025]). In the speaker dynamics, the rate constant and the steady velocity are β = 10s−1 and υ¯ = 1 ms−1 , respectively. For the computation of the global likelihood, the number of consensus iterations is C = 10. In the multi-hypothesis model, the prior probabilities for the hypotheses are q0 = 0.25, qp = 0.1825, p = 1, 2, . . . , Nc , and the standard deviation of the TDOA errors is σ = 50 μs. B. Tracking Performance Metrics The root mean square error (RMSE) has been widely used to rk be the ground evaluate the tracking performance. Let rk and ˆ truth and estimate of the speaker position at time k, respectively. The RMSE is defined as [13] K 1 RMSE = ! ||rk − ˆ rk ||2 . (40) K k =1
Essentially, the RMSE indicates how much the estimated position deviates from its truth. The smaller the RMSE, the more accurate of the tracking results. To evaluate the DMAPF-based speaker tracking method (referred to as DMAPF-S), some comparative experiments are performed with the existing tracking methods, i.e., the SIR-PF [13], the EKPF [16], the CAPF [17] and the CMAPF [20]. The SIRPF, EKPF and CAPF are all conducted without the TDOA selection scheme. The proposed method without the TDOA selection is referred to as the DMAPF. Likewise, the CMAPF with and without the TDOA selection are referred to as the CMAPF-S and CMAPF, respectively. All these methods are evaluated in
Fig. 6. Communication graphs of the microphone network: (a) G2 (R = 1.7 m), (b) G3 (R = 2.1 m).
terms of RMSE, and the results are averaged over 50 Monte Carlo trials. C. Simulation Results Fig. 3 depicts an example of the tracking results with the proposed DMAPF-S, the CMAPF, the CAPF, the SIR-PF, and the EKPF methods. It illustrates that the proposed method is capable to track the speaker accurately. 1) Effect of Reverberation Time T60 : To investigate the influence of reverberation on the tracking performance, experiments were conducted for a set of different reverberation times, i.e., T60 = {0.10, 0.15, 0.20, 0.25, 0.30} s. Fig. 4 shows that the RMSE value of all the methods becomes larger as T60 increases, indicating that the tracking accuracy degrades as the reverberation time increases. Besides, it can be observed that the RMSE value of the DMAPF-S is comparable to that of the CMAPF-S. Similar results can be found for DMAPF and CMAPF. From the RMSE results of DMAPF-S versus DMAPF, and the results of CMAPF-S versus CMAPF, we can find that the tracking accuracies under the scenarios with and without TDOA selection are very close to each other, which indicates that the TDOA selection scheme does not provide sufficient robustness to reverberation. Moreover, it can be seen that both DMAPF-S and
ZHANG et al.: DMAPF FOR SPEAKER TRACKING IN DISTRIBUTED MICROPHONE NETWORKS
1931
Fig. 7. Communication graphs of the microphone network with node failures: (a) G4 , (b) G5 , (c) G6 , and (d) G7 (bold circles represent failed nodes, and dash line denotes failed links between nodes).
TABLE IV RMSE RESULTS OF THE DMAPF-BASED SPEAKER TRACING METHOD WITH DIFFERENT GRAPHS Iterations (C)
6 8 10 15 20 30
Without node failures
With node failures
G1
G2
G3
G4
G5
G6
G7
0.1444 0.1424 0.1405 0.1368 0.1366 0.1353
0.1502 0.1499 0.1478 0.1390 0.1368 0.1357
0.1412 0.1407 0.1386 0.1369 0.1353 0.1350
0.1613 0.1609 0.1579 0.1532 0.1501 0.1484
0.3346 0.3191 0.3180 0.3177 0.3001 0.2963
0.2650 0.2628 0.2625 0.2597 0.2592 0.2507
0.1783 0.1760 0.1745 0.1696 0.1639 0.1613
DMAPF perform better than the CAPF method as T60 increases. This is because through marginalization the linear state variable is estimated with the optimal filter, which indicates better tracking accuracy. Furthermore, we can observe that the DMAPF-S performs much better than the SIR-PF and EKPF methods. The reason is that the current observations are incorporated into the proposal for particle sampling via the APF, and that through marginalization the linear state variable is estimated by the optimal filter to improve the estimation accuracy. 2) Effect of SNR Conditions: The impact of background noise on the tracking performance is also evaluated for a set of
different SNR values, i.e., SNR ={0, 5, 10, 15, 20, 25, 30} dB. Fig. 5 shows that all the methods perform better as SNR increases. Noticeably, the DMAPF-S and DMAPF obtain similar tracking accuracies to those of the CMAPF-S and CMAPF, respectively. From Fig. 5, it can be also observed that the tracking accuracies of all the methods are very close to each other for SNR ≥ 15 dB. Whereas, the DMAPF-S and CMAPF-S perform better than the others for SNR < 15 dB, which becomes more remarkable when SNR is lower. From the RMSE results for SNR < 15 dB, we can observe that the DMAPF-S obtains better tracking accuracy than the DMAPF. This is because through the TDOA selection, more underlying reliable observations are chosen by the DMAPF-S to improve its estimation accuracy. Such advantage of the DMAPF-S over the DMAPF becomes more remarkable as SNR decreases, which indicates that by using the TDOA selection scheme the DMAPF-S is more robust against white Gaussian noise. Furthermore, by comparing the performance differences between the DMAPF-S and DMAPF in Fig. 4 with the differences of them in Fig. 5, it is observable that through the TDOA selection scheme the DMAPF-S obtains better robustness against white Gaussian noise than reverberation. 3) Effect of Communication Graphs: To further evaluate the proposed speaker tracking method, different communication graphs for the microphone network are tested under the
1932
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
Fig. 8. Speaker tracking results with failed nodes: (a) X-dimension, (b) Y-dimension.
condition of T60 = 0.1 s and SNR = 10 dB. Fig. 6 depicts the graphs with the communication radii of R = 1.7 m and R = 2.1 m, denoted as G2 and G3 , respectively. Besides, we also consider the graphs when some nodes of the network fail during the tracking procedure. Herein, we say “a node becomes failure” if it cannot provide observations and communicate with its neighbors, which in terms of the communication graph means that the edges associated with this node are removed from the graph. Fig. 7 depicts the graphs with different numbers of failed nodes for R = 1.8 m, named as G4 , G5 , G6 , and G7 , respectively. While other factors (such as the ambient environments, speaker trajectory, and the locations of nodes) are fixed, different graphs only affect the data exchange among neighboring nodes during the consensus computation of the DMAPF-S, which further affects the tracking performance. Table IV shows the RMSE results of the DMAPF-S versus different graphs and consensus iteration numbers C. We can observe that, for a given graph, the RMSE results becomes almost invariant as C increases. This is because each node asymptotically obtains the global likelihood as the number of consensus iterations increases, and then the RMSE results reach steady. When R is large, there are more communication links among nodes for data exchange,
Fig. 9. (a) Real office environment. (b) Layout for the microphones and speaker trajectory in real-world experiments.
thus less consensus iterations are usually required to reach such steady result. In contrast with the graphs G1 , G2 and G3 , the RMSE values of G4 , G5 , G6 and G7 are somewhat larger. Actually, when some nodes become failure, less observations are available for estimation, which somewhat degrades the tracking performance. Nevertheless, such degradation is acceptable and the DMAPF-S still obtains proper estimated results, indicating its robustness against node failures. Fig. 8 illustrates the estimated speaker trajectory of the DMAPF-S when node failure occurs, and it also shows that the proposed method is capable to track the speaker properly when some nodes become failure. From Table IV, we can also observe that the RMSE values of the graphs with more failed nodes are not always larger than those with less ones. For instance, the RMSE values of G7 are slightly larger than G4 , but lower than those of G5 and G6 . Though with identical number of failed nodes, the RMSE values of G5 are much higher than those of G4 . Actually, the failure of nodes
ZHANG et al.: DMAPF FOR SPEAKER TRACKING IN DISTRIBUTED MICROPHONE NETWORKS
1933
network comprises ten pairs of microphones (Model: KZ-800) at the height of 1.45 m where the inter-microphone distance per pair is 0.6 m, and the speaker trajectory is similar to the simulation setups, as shown in Fig. 9(b). The sound source is a sphere loudspeaker (Model:BSWA-OS002) driven by an amplifier (Model: FEILE USB-180M), emitting a four-second clean female utterance recording (sampled at 16 kHz and digitized at 16 bit), and it is moved from (1.25, 2.03) m to (3.56, 2.46) m at a velocity of around 0.9 m/s, generating 40 position points. The reverberation time T60 is about 310 ms, which is measured as the 60 dB decay period for the energy of a high level white noise signal emitted by a loudspeaker (Model:BSWA-OS002) after it is shut down. The ambient noise is primarily from the cooling fans of the computers and the amplifier, and its noise level measured by a sound pressure meter (Model: TES1375) is 44 dB (A-weighted). The tracking results in the real-world experiments are depicted in Fig. 10. From the results it can be observed that the proposed DMAPF-S method can track the moving speaker successfully. Since the SNR level is high, the tracking results of all these methods, including the proposed method, are close to each other. VII. CONCLUSION
Fig. 10. Speaker tracking results in real-world experiments: (a) X-dimension, (b) Y-dimension.
changes not only the communication graph of the network, but also its geometry. It has been proved that the geometry of the network could affect the localization accuracy [30], [31]. Usually, the nodes are arranged to span an area enclosing the source for desirable estimation accuracy [31]. Note from Figs. 7(b)-(c) that a part of the speaker trajectory goes out of the bound of graphs G5 and G6 , leading to the degradation of their estimation accuracies, but the proposed method can still track the speaker properly, as shown in Table IV. Though with more failed nodes, the nodes of G7 still enclose the speaker trajectory properly, yielding better tracking accuracy than G5 and G6 . The problem of geometry optimization for the minimum estimation error has been discussed in [31] and the references thereof. For the distributed networks with given graph and geometry, the problems of adaptive sensor selection for the best tracking performance are comprehensively discussed in [32]. D. Real-World Experiments The validation of the proposed method is also investigated in a typical office room of size 10 m × 7 m × 3 m, where a rectangular area of size 4.8 m × 3.6 m is used to carry out the real-world experiments, as shown in Fig. 9(a). The distributed
In this paper, we propose a DMAPF, and then apply it for speaker tracking in distributed microphone networks. Through marginalizing the state-space model, the speaker’s position and velocity are estimated with the distributed APF and the distributed KF, respectively. To cope with the adverse effects of noise and reverberation, a TDOA selection scheme is presented to construct the local observation vector. Besides, a multiplehypothesis model is used as the likelihood of the DMAPF. With the DMAPF-based speaker tracking method, each node has access to a global estimate of the speaker position. The proposed method possesses the advantages of the MPF and the APF. Moreover, as a distributed estimation approach, it requires only local communications among neighboring nodes, and is scalable for speaker tracking. Comparative experiments with existing methods show that the proposed method has good robustness against noise and reverberation, and it can also obtain proper tracking accuracy with node failures. The proposed method is mainly suitable for the tracking problems with linear and Gaussian substructure. Thus, it is unfeasible for the scenarios where the process or observation noise is non-Gaussian and both the process and observation models are nonlinear. REFERENCES [1] B. W. Chen, C. Y. Chen, and J. F. Wang, “Smart homecare surveillance system: Behavior identification based on state-transition support vector machines and sound directivity pattern analysis,” IEEE Trans. Syst., Man, Cybern. A, Syst., vol. 43, no. 6, pp. 1279–1289, Nov. 2013. [2] M. Cobos, J. J. Perez-Solano, S. Felici-Castell, J. Segura, and J. M. Navarro, “Cumulative-sum-based localization of sound events in low-cost wireless acoustic sensor networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1792–1802, Jan. 2014. [3] T. P. Spexard, M. Hanheide, and G. Sagerer, “Human-oriented interaction with an anthropomorphic robot,” IEEE Trans. Robot., vol. 23, no. 5, pp. 852–862, Oct. 2007.
1934
IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2016
[4] A. Bertrand, “Signal processing algorithms for wireless acoustic sensor networks,” Ph.D. dissertation, Faculty Eng., Katholieke Univ. Leuven, Leuven, Belgium, May 2011. [5] M. Taseska and E. A. P. Habets, “Informed spatial filtering for sound extraction using distributed microphone arrays,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 7, pp. 1195–1207, May 2014. [6] A. Canclini, E. Antonacci, A. Sarti, and S. Tubaro, “Acoustic source localization with distributed asynchronous microphone networks,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 2, pp. 439–443, Feb. 2013. [7] A. Canclini, P. Bestagini, A. Sarti, and S. Tubaro, “A robust and lowcomplexity source localization algorithm for asynchronous distributed microphone networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1563–1575, Jun. 2015. [8] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp, “A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking,” IEEE Trans. Signal Process., vol. 50, no. 2, pp. 174–188, Feb. 2002. [9] P. Stano, Z. Lendek, J. Braaksma, R. Babuska, C. de Keizer, and A. J. den Dekker, “Parametric Bayesian filters for nonlinear stochastic dynamical systems: A survey,” IEEE Trans. Cybern., vol. 43, no. 6, pp. 1607–1624, Dec. 2013. [10] Y. Tian, Z. Chen, and F. Yin, “Distributed Kalman filter-based speaker tracking in microphone array network,” Appl. Acoust., vol. 89, pp. 71–77, 2015. [11] Y. Tian, Z. Chen, and F. Yin, “Distributed IMM-unscented Kalman filter for speaker tracking in microphone array networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 10, pp. 1637–1647, Oct. 2015. [12] J. Vermaak and A. Blake, “Nonlinear filtering for speaker tracking in noisy and reverberant environments,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Salt Lake City, UT, USA, May 2001, pp. 3021–3024. [13] D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle filtering algorithms for tracking an acoustic source in a reverberant environment,” IEEE Trans. Audio, Speech, Lang. Process., vol. 11, no. 6, pp. 826–836, Nov. 2003. [14] F. Talantzis, “An acoustic source localization and tracking framework using particle filtering and information theory,” IEEE Trans. Audio Speech Lang. Process., vol. 18, no. 7, pp. 1806–1817, Sep. 2010. [15] A. Levy, S. Gannot, and E. A. P. Habets, “Multiple-hypothesis extended particle filter for acoustic source localization in reverberant environments,” IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 6, pp. 1540–1555, Aug. 2011. [16] X. Zhong and J. R. Hopgood, “Particle filtering for TDOA based acoustic source tracking: Nonconcurrent multiple talkers,” Signal Process., vol. 96, pp. 382–394, Mar. 2014. [17] M. K. Pitt and N. Shephard, “Filtering via simulation auxiliary particle filters,” J. Amer. Statist. Assoc., vol. 94, no. 446, pp. 590–599, Jun. 1999. [18] T. Schon, F. Gustafsson, and P. J. Nordlund, “Marginalized particle filters for mixed linear/nonlinear state-space models,” IEEE Trans. Signal Process., vol. 53, no. 7, pp. 2279–2289, Jul. 2005. [19] X. Zhong and J. R. Hopgood, “A time-frequency masking based random finite set particle filtering method for multiple acoustic source detection and tracking,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 12, pp. 2356–2370, Dec. 2015. [20] C. Fritsche, T. B. Schon, and A. Klein, “The marginalized auxiliary particle filter,” in Proc. 3rd IEEE Int. Workshop Comput. Adv. Multi-Sensor Adaptive Process., Aruba, Dutch Antilles, Dec. 2009, pp. 289–292. [21] O. Hlinka, F. Hlawatsch, and P. M. Djuric, “Distributed particle filtering in agent networks: A survey, classification, and comparison,” IEEE Signal Process. Mag., vol. 30, no. 1, pp. 61–81, Jan. 2013. ¨ [22] D. Ustebay, M. Coates, and M. Rabbat, “Distributed auxiliary particle filters using selective gossip,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Prague, The Czech Republic, May 2011, pp. 3296–3299. [23] X. Lin and S. Boyd, “Fast linear iterations for distributed averaging,” Syst. Control Lett., vol. 53, no. 1, pp. 65–78, Dec. 2004. [24] S. Khanal, H. F. Silverman, and R. R. Shakya, “A free-source method (FrSM) for calibrating a large-aperture microphone array,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 8, pp. 1632–1639, Aug. 2013. [25] R. Karlsson, T. Schon, and F. Gustafsson, “Complexity analysis of the marginalized particle filter,” IEEE Trans. Signal Process., vol. 53, no. 11, pp. 4408–4411, Nov. 2005.
[26] D. J. Lee, “Nonlinear estimation and multiple sensor fusion using unscented information filtering,” IEEE Signal Process. Lett., vol. 15, no. 8, pp. 861–864, Dec. 2008. [27] C. Knapp and G. C. Carter, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust. Speech Signal Process., vol. ASSP-24, no. 4, pp. 320–327, Aug. 1976. [28] D. Bechler and K. Kroschel, “Three different reliability criteria for time delay estimates,” in Proc. IEEE 12th Eur. Signal Process. Conf., Vienna, Austria, Sep. 2004, pp. 1987–1990. [29] E. A. Lehmann, A. M. Johansson, and S. Nordholm, “Reverberation-time prediction method for room impulse responses simulated with the imagesource model,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., New Paltz, NY, USA, Oct. 2007, pp. 159–162. [30] B. Yang, “Different sensor placement strategies for TDOA based localization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Honolulu, HI, USA, Apr. 2007, pp. 1093–1096. [31] K. C. Ho and L. M. Vicente, “Sensor allocation for source localization with decoupled range and bearing estimation,” IEEE Signal Process., vol. 56, no. 12, pp. 5773–5789, Dec. 2008. [32] A. Mohammadi and A. Asif, “Consensus-based distributed dynamic sensor selection in decentralised sensor networks using the posterior CramrRao lower bound,” Signal Process., vol. 108, pp. 558–575, Mar. 2015.
Qiaoling Zhang is currently working toward the Ph.D. degree in signal and information processing in the School of Information and Communication Engineering, Dalian University of Technology, Dalian, China. Her research interests include speech processing, object localization, and tracking.
Zhe Chen (S’99–M’03–SM’12) received the B.S. degree in electronics engineering, the M.S. and Ph.D. degrees in signal and information processing from Dalian University of Technology (DUT), Dalian, China, in 1996, 1999, and 2003, respectively. In 2002, he joined the Department of Electronics Engineering, DUT, as a Lecturer, and became an Associate Professor in 2006. His research interests include speech processing, image processing, and wideband wireless communication.
Fuliang Yin was born in Fushun City, Liaoning Province, China, in 1962. He received the B.S. degree in electronic engineering and the M.S. degree in communications and electronic systems from Dalian University of Technology (DUT), Dalian, China, in 1984 and 1987, respectively. In 1987, he joined the Department of Electronic Engineering, DUT, as a Lecturer, and became an Associate Professor in 1991. Since 1994, he has been a Professor at DUT, and the Dean of the School of Electronic and Information Engineering, DUT from 2000 to 2009. His research interests include digital signal processing, speech processing, image processing, and broadband wireless communication.