A Speaker Tracking Algorithm Based on Audio and Visual Information Fusion Using Particle Filter Xin Li1 , Luo Sun1 , Linmi Tao1 , Guangyou Xu1 , and Ying Jia2 1
Key Laboratory of Pervasive Computing, Ministry of Education Department of Computer Science and Technology Tsinghua University, Beijing 100084, China {x-li02, sunluo00}@mails.tsinghua.edu.cn {linmi, xgy-dcs}@tsinghua.edu.cn 2 Intel China Research Center Raycom Infotech Park A, Beijing 100080 China
[email protected]
Abstract. Object tracking by sensor fusion has become an active research area in recent years, but how to fuse various information in an efficient and robust way is still an open problem. This paper presents a new algorithm for tracking speaker based on audio and visual information fusion using particle filter. A closed-loop architecture with reliability of each individual tracker is adopted, and a new method for data fusion and reliability adjustment is proposed. Experiments show the new algorithm is efficient in fusing information and robust to noise.
1
Introduction
Intelligent environments such as distributed meetings and smart classrooms are gaining significant attention during the past few years [1] [2]. One of the key technology in these systems is a reliable speaker tracking module, since the speaker often needs to be emphasized. Now there exist tracking methods both by audio information (Sound Source Localization) (SSL)[3] and visual information[11]. As methods using only audio or visual tracking are not robust, researchers are now paying more and more attention to the fusion of audio and visual information. In general, there are two paradigms for audio and visual fusion: bottom-up and top-down. Both paradigms have a fuser (a module to fuse information) and multiple sensors. The bottom-up paradigm starts from sensors. Each sensor uses a tracker to estimate the unknown object state (e.g. object location and orientation)–to solve the inverse problem based on the sensory data. Once individual tracking results are available, distributed sensor networks [4] or graphical models [5] are used to fuse them together to generate a more accurate and robust result. To make the inverse problem tractable, assumptions are typically made in the trackers and the fuser, e.g., system linearity and Gaussianality are assumed in the Kalman A. Campilho, M. Kamel (Eds.): ICIAR 2004, LNCS 3212, pp. 572–580, 2004. c Springer-Verlag Berlin Heidelberg 2004
A Speaker Tracking Algorithm
573
tracker [6] and the fuser [4]. But these simplified assumptions inherently hinder the robustness of the tracking system. The top-down paradigm, on the other hand, emphasizes the fuser. It has an intelligent fuser but rather simple sensors [7][8]. It tries to achieve tracking by solving the forward problem. First, the fuser generates a set of hypotheses (also called particles. We use the two words interchangeably in the paper) to exploit the possible state space. Sensory data are then used to compute the likelihood/weight of each hypothesis. These weighted hypotheses are then used by the fuser to estimate the distribution of the object state. As it is usually much easier to verify a given hypothesis than to solve the inverse tracking problem (as in the bottom-up paradigm), more complex and accurate models can be used in the top-down paradigm. This in turn results in more robust tracking. However, because the sensors use verifiers instead of trackers, they do not help the fuser generate good hypotheses. The hypotheses are semi-blindly generated from the motion prediction[7]. So, when the possible state space is large, a great number of particles will be needed and this results in heavy computational cost. Recently, Y.Chen and Y.Rui proposed a new fusion framework that integrates the two paradigms in a principled way[10]. It uses a closed-loop architecture where the fuser and multiple sensors interact to exchange information by evaluating the reliability of various trackers. However, due to different characters of the visual and audio tracker, this method occasionally depresses the information provided by the audio tracker, thus is not robust under some conditions. In this paper, we propose a new fusion and tracker reliability adjustment method. Based on a closed-loop architecture, the new method emphasizes the audio information by making the visual tracker and audio tracker more symmetric. The tracking system then becomes more efficient in information fusion and robust to many kinds of noise. The rest of the paper is organized as follows. In section 2 we discuss our system framework and individual trackers. Section 3 describes our method of fusion and adjustment in detail. Section 4 gives some of the experimental results. Section 5 draws a brief conclusion.
2
System Framework and Individual Trackers
Our system uses an architecture similar as that in [10]. Individual trackers are first used to estimate the target’s position, then these information are sent to a fuser to get a more accurate and robust tracking result. Then the fuser computes and adjusts the reliability of each individual tracker. The whole process can be summarized as follows: 1. Estimate the target’s position by individual trackers. 2. Fuse information to get the final result. 3. Adjust the reliability of individual trackers. 4. Goto 1, process the data of next frame. We use a vision-based color tracker and an audio-based SSL tracker as individual trackers. The color tracker is used to track the speaker’s head, which is
574
X. Li et al.
modeled as an ellipse and initialized by a face detector. The SSL tracker is used to locate the speaker. 2.1
Audio Tracker (SSL Tracker)
Audio SSL is used to locate the position of the speaker. In our particular application of smart classroom, the system cares the most about the horizontal location of the speaker. Suppose there are two microphones, A and B. Let s(t) be the speaker’s source signal, and x1 (t) and x2 (t) be the signals received by the two microphones, we have: x1 (t) = s(t − D) + h1 (t) ∗ s(t) + n1 (t) x2 (t) = s(t) + h2 (t) ∗ s(t) + n2 (t) ,
(1)
Where D is the time delay between the two microphones, h1 (t) and h2 (t) represent reverberations, and n1 (t) and n2 (t) are the additive noise. Assuming the signal and noise are uncorrelated, D can be estimated by finding the maximum ˆ x x (τ ))[3]. cross correlation between x1 (t) and x2 (t) (i.e. D = argmaxR 1 2 Suppose the middle point between the two microphones is position O, let D·v the source be at location S. We can estimate φ = SOB by φ = arccos |AB| , when |OS| |AB| as shown in [3]. v is the speed of sound. This process can be generalized to a Microphone Array. For simplicity, we placed the camera and the Microphone Array so that their center points coincide in the horizontal plane. Given the parameters of the camera: focal length f , horizontal resolution kv , horizontal middle point in the pixel coordinate v0 , the angle estimated by the Microphone Array can be converted into the camera’s horizontal pixel coordinate: xs = v0 − f · kv /tan(φ) .
(2)
A reliability factor for SSL is also calculated based on the steadiness of the SSL results: we took sound source locations in n consecutive frames: Xs1 , Xs2 , ... , Xsn . The maximum difference between each consecutive pair is calculated: dmax = maxn−1 i=1 (|Xsi+1 − Xsi |) .
(3)
Then the reliability factor can be obtained by a Gaussian model: 2
2
λs = e−dmax /2σ ,
(4)
where σ is a parameter indicating the tolerance of the sound source position difference. 2.2
Visual Tracker (Color Tracker)
We used a kernel-based color tracker[11] to estimate the target’s position in a new frame. We assume that the color histogram of the target hobj is stable. To
A Speaker Tracking Algorithm
575
track the object in the current frame, precious frame’s state X0 = Xct−1 is used as an initial guess. The following steps are used to find the new object state Xct : 1. Let l index the iterations. Set l = 0. 2. Initialize the location of the target in the current frame with Xct−1 , compute the color histogram hXl at Xl and evaluate the similarity between the candidate and the target by computing the Bhattacharyya Coefficient ρ[hobj , hXl ][11]. 3. Derive the weights containing the gradient information[11]. 4. Find the next location of the target candidate using the gradient information: XN [11]. 5. Compute the color histogram hXN at the new position XN and the similarity between hXN and the target using the Bhattacharyya Coefficient ρ[hobj , hXN ]. 6. If ρ[hobj , hXN ] > ρ[hobj , hXl ], goto step 7. Else, let XN = (Xl + XN )/2, goto step 5. 7. If ||XN − Xl || < , stop. Else, let l = l + 1 and Xl = XN and goto step 2.
3
Data Fusion and Reliability Adjustment
We first briefly review the Importance Particle Filter (ICondensation) algorithm[9], which is used in our system and then we describe our data fusion method. 3.1
Generic Particle Filter
In the Condensation algorithm the posterior distribution of the target’s position is approximated by properly weighted discrete particles: p(X0:t |Z1:t ) =
N
(i)
(i)
w0:t · δ(X0:t − X0:t ) ,
(5)
i=1 (i)
Where w0:t are the weights of the particles and δ stands for the function δ(x) = 1, x = 0; δ(x) = 0, x = 0. As N − > ∞, this approximation will get closer and closer to the actual posterior. The target’s position can then be estimated by taking the expectation of the posterior. The ICondensation algorithm draws particles from an importance function, (i) q, to concentrate the particles in the most likely space. The weights w0:t are then calculated as: (i)
(i)
w ˜0:t =
(i)
(i)
p(Z1:t |X0:t )p(X0:t |X0:t−1 ) (i)
(i)
q(X0:t |X0:t−1 , Z1:t )
(i)
w ˜ (i) w0:t = N 0:t (i) . ˜0:t i=1 w
(6)
A recursive calculation of the weights can be obtained [12]: (i)
(i)
(i)
˜t−1 w ˜t = w
(i)
(i)
p(Zt |Xt )p(Xt |Xt−1 ) (i)
(i)
q(X0:t |X0:t−1 , Z1:t )
.
(7)
576
X. Li et al.
Then, the Importance Particle Filtering process can be summarized in three steps: (i) 1. Sampling: N particles Xt , i = 1, ..., N are sampled from the proposal function q(Xt |X0:t−1 , Z1:t ). 2. Measurement: Compute the particle weights by Eq (7). 3. Output: Decide the object state according to the posterior distribution. In the ICondensation algorithm, importance function is very crucial. For example, poor proposals (far different from the true posterior) will generate particles that have negligible weights, thus wasted. While particles generated from good proposals (similar to the true posterior) are highly effective. Choosing the right proposal distribution is therefore of great importance. 3.2
Fusion and Adjustment
Similar as in [10], we use individual trackers discussed above to generate hypotheses and verifiers (observation models) to calculate weights. However, in [10], the contour tracker and color tracker both use the tracking result of the previous frame as an initialization, so the reliability of these trackers will usually be pretty high as their proposals will not be far from the posterior. The SSL tracker, on the other hand, doesn’t depend on previous tracking results and its result is not always so accurate. So sometimes its reliability may become very low while in fact it provides valuable information about the speaker. We’ve found in an experiment that when a man passes by in front of the speaker while the SSL result is distracted a little from the speaker due to some inaccuracies, tracking is lost for a while and this decreases the reliability of the SSL tracker. The audio information is then depressed and few particles are drawn from it, which further makes the tracking result lost. In turn the reliability of the SSL tracker continues to drop and the lost tracking may not recover. To overcome this defect, we develop a new fusion and reliability adjustment method which emphasizes the audio information by making the visual and audio tracker more symmetric. Other than proposing a joint prior based on audio and visual tracker as in [10], we treat the visual tracker (color tracker) and the audio tracker (SSL tracker) separately. We assigned the two trackers each a reliability factor, λv , λa , where λv + λa = 1. Note λa is different from λs . The particle filter then proceed as follows: 1. Generate prior distribution: We generate two prior distributions: qv (Xt ) = N(Xc ,Σc ) (Xt ) qa (Xt ) = N(Xs ,Σs ) (Xt ) ,
(8) (9)
where N indicates a normal distribution, and Xc , Xs are the expectations– estimated object position achieved by the color tracker and SSL tracker. Σc and Σs are the covariance matrices (in 1 dimension, the variance) of the two distributions respectively, indicating the uncertainty of the two trackers.
A Speaker Tracking Algorithm
577
2. Generate particles and calculate weights: Particles are drawn from the two distributions qv (Xt ) and qa (Xt ) respectively. We then define the visual and audio observation models to calculate the weights. The visual observation model is defined as the similarity between the color histograms of the candidate and the target[11]: pc (Zt |Xt ) = ρ[hobj , hXl ] .
(10)
And the audio observation model is defined as the ratio of the correlation value at this position to the highest correlation value[3]: ˆ x x (DX )/R ˆ x x (D) pa (Zt |Xt ) = R 1 2 1 2 DX = |AB|/v · sin(arctan((f · kv )/(v0 − Xt ))) .
(11)
Assuming independence between the two observation models, the likelihoods are then calculated as: p(Zt |Xt ) = pc (Zt |Xt ) · pa (Zt |Xt ) .
(12)
The weights of particles are then calculated using Eq (7) 3. Decide the final target position: After the previous two steps we actually get two posterior distributions about the target: pv (Xt |Zt ), pa (Xt |Zt ). Two estimates of the target’s position are then obtained: Xvt = Epv (Xt |Zt ) Xat = Epa (Xt |Zt ) .
(13) (14)
E indicates the expectation. The likelihoods of these two estimations are: p(Zt |Xvt ) = pc (Zt |Xvt ) · pa (Zt |Xvt )
(15)
p(Zt |Xat ) = pc (Zt |Xat ) · pa (Zt |Xat ) .
(16)
By applying the reliability of the video and audio tracker, we get: L(Xvt ) = λv · p(Zt |Xvt ) L(Xat ) = λs · λa · p(Zt |Xvt ) .
(17) (18)
Finally the maximum of L(Xvt ) and L(Xat ) is selected to decide the target’s position to be Xvt or Xat . 4. Reliability Adjustment: In this step we tune the reliability factors. λs is already calculated above, here we adjust λv and λa according to p(Zt |Xvt ) and p(Zt |Xat ) obtained earlier. p(Zt |Xvt ) and p(Zt |Xat ) indicate how likely the estimated target’s position is to be the true position. So we define: λv = p(Zt |Xvt )/(p(Zt |Xvt ) + p(Zt |Xat )) λa = p(Zt |Xat )/(p(Zt |Xvt ) + p(Zt |Xat )) .
(19) (20)
578
4
X. Li et al.
Experimental Results
Our algorithm has shown the following advantages. First, in the condition of finite (especially small number) particles, it sufficiently exploits the audio information by drawing a comparatively large number of particles form it, thus enhancing the robustness of tracking. Even if sometimes the tracker may fail, for example, when a person comes across the speaker, later the system will recover this error by using audio information (obtain a larger L(xat ) than L(xvt )). Second, our algorithm is also robust to audio noise. When audio noise occurs, the sound source localization obtained will become unsteady, resulting in a small λs and this will decrease the influence of noise. Now we give some of our experimental results. In all frames the red rectangle represents the tracking result, the green line represents the result of SSL. The color tracker uses 32*32*32 bins for the quantification of color space and 300 particles are used in ICondensation. The system runs on a standard AMD 1.8GHz CPU while processing the 15frames/sec video sequence. Figure 1 shows our algorithm-based fused tracker is more robust than a single vision-based tracker.
Fig. 1. Single vision-based tracker VS. our fused tracker. Upper row (left to right) is tracking by a single vision-based color tracker, lower row (left to right) is tracking by our fused tracker.
The single vision-based tracker (upper 3 frames) loses track, while the fused tracker (lower 3 frames) doesn’t. Figure 2 shows our new algorithm is more robust than the algorithm used in [10], when both using a color tracker and a SSL tracker. The upper 3 frames are tracking by the joint prior and reliability adjustment method in [10]. Tracking is lost and can not recover because the reliability of the audio tracker decreases rapidly. The lower 3 frames are tracking by our algorithm, which shows that tracking recovers after the two persons cross each other.
A Speaker Tracking Algorithm
579
Fig. 2. Compare with the method in [10]. Upper row (left to right) is tracking by fusion and adjustment method in [10], tracking is lost. Lower row (left to right) is tracking by our fusion and adjustment method, tracking recovers.
Fig. 3. Test against noise (left to right, up to down), including light change (turning on/off lights), background change, persons coming across each other and audio noises.
Figure 3 shows our algorithm is robust to noises. In this sequence our algorithm is tested against light change (turning on/off lights), background change, persons coming across each other, and also the audio noise in the room (computer fans, TV monitors’ noise etc).
5
Conclusion
In this paper, we presented a speaker tracking algorithm based on fusing audio and visual information. Based on a closed-loop architecture, we proposed a new fusing and tracker-reliability adjustment method which better exploits the symmetry between visual and audio information. Individual trackers are first used to track the speaker, then particle filter is used to fuse the information. Experiments show that with our proposed method, the system is efficient in fusing
580
X. Li et al.
information and robust to many kinds of noises. In future work, other trackers, such as the contour tracker, can also be included in the algorithm (as another visual tracker, for example) to further enhance the robustness.
References [1]
Ross Cutler, Yong Rui, Anoop Gupta, JJ Cadiz, Ivan Tashev, Li wei He, Alex Colburn, Zhengyou Zhang, Zicheng Liu, and Steve Silverberg, Distributed meetings: A meeting capture and broadcasting system, in Proc. ACM Conf. on Multimedia, 2002, pp. 123.132. [2] Yong Rui, Liwei He, Anoop Gupta, and Qiong Liu, Building an intelligent camera management system, in Proc. ACM Conf. on Multimedia, 2001, pp. 2.11. [3] Yong Rui and Dinei Florencio, Time delay estimation in the presence of correlated noise and reverberation, Technical Report MSRTR- 2003-01, Microsoft Research Redmond, 2003. [4] K. C. Chang, C. Y. Chong, and Y. Bar-Shalom, Joint probabilistic data association in distributed sensor networks, IEEE Trans. Automat. Contr., vol. 31, no. 10, pp. 889.897, 1986. [5] J. Sherrah and S. Gong, Continuous global evidence-based Bayesian modality fusion for simultaneous tracking of multiple objects, in Proc. IEEE Int’l Conf. on Computer Vision, 2001, pp. 42.49. [6] B. Anderson and J. Moore, Optimal Filtering, Englewood Cliffs, NJ: PrenticeHall, 1979. [7] J. Vermaak, A. Blake, M. Gangnet, and P. Perez, Sequential Monte Carlo fusion of sound and vision for speaker tracking, in Proc. IEEE Int’l Conf. on Computer Vision, 2001, pp. 741.746. [8] G. Loy, L. Fletcher, N. Apostoloff, and A. Zelinsky, An adaptive fusion architecture for target tracking, in Proc. Int’l Conf. Automatic Face and Gesture Recognition, 2002, pp. 261.266. [9] M. Isard and A. Blake, ICONDENSATION: Unifying low-level and high-level tracking in a stochastic framework, in Proc. European Conf. on Computer Vision, 1998, pp. 767.781. [10] Y. Chen and Y. Rui, Speaker Detection Using Particle Filter Sensor Fusion, in Asian Conf. on Computer Vision, 2004. [11] D. Comaniciu and P. Meer, Kernel-Based Object Tracking, in IEEE Transactions on Pattern Analysis and Machine Intelligence, VOL. 25, NO. 5, MAY 2003 [12] R. Merwe, A. Doucet, N. Freitas, and E. Wan, .The unscented particle Filter,. Technical Report CUED/F-INFENG/TR 380, Cambridge University Engineering Department, 2000.