Since there are no intelligent cameras dedicated to human tracking (as op- .... tionally cheap process, as long as repetitions are limited to a certain number of ...
An MCMC-based Particle Filter for Multiple Person Tracking I. Zuriarrain†‡ , F. Lerasle‡¶ , N. Arana† , M. Devy‡ † University of Mondragon, Goi Eskola Politeknikoa, Spain ‡ CNRS; LAAS; 7 avenue du Colonel Roche, 31077 Toulouse Cedex 4, France ¶ Universit´e de Toulouse; UPS; 118 route de Narbonne, 31062 Toulouse Cedex 9, France {izuriarrain, narana}@eps.mondragon.edu, {lerasle, devy}@laas.fr
Abstract This paper presents a Markov Chain Monte Carlo (MCMC) based particle filter to track multiple persons dedicated to video surveillance applications. This hybrid tracker, devoted to networked intelligent cameras, takes benefit from the best properties of both MCMC and joint particle filter. A saliency map-based proposal distribution is shown to limit the well-known burst in terms of particles and MCMC iterations. Qualitative and quantitative results for real-world video data are presented.
1. Introduction and framework Visual multiple-target tracking (MTT) has received tremendous attention in the Vision community due to its video surveillance applications, e.g. to help extend independent living for the elderly in their own homes. Deploying a network of ceiling-mounted cameras is a deep challenge because it should be easy to install by a non-expert user while the cameras should have onboard CPU resources in order to exchange only high level data (such as positions and characteristics of the targeted persons) over the network. Since there are no intelligent cameras dedicated to human tracking (as opposed to human detection) available off-the-shelf currently, our tracker is devoted to an intelligent camera based on FPGA boards in order to execute parts of the algorithm in parallel (figure 1). Besides this broader technologic aim, the traditional challenge with MTT is to simultaneously track persons who can a priori enter, exit, pass close to one another or merge in the scene. The classical MTT literature adresses these difficulties thanks to either a decentralized [8, 12] or centralized [3, 4, 13] solution. The former, based on (possibly interactive) distributed filters i.e., one filter per target, suffers from “data association errors” whenever targets
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
Figure 1. Delta Technologies wireless camera and FPGA board close-up [1].
pass close to one another. Besides, the centralized approach estimates a joint state which concatenates all of the targets’ states and so estimates both discrete (number of targets) and continuous variables (positions). By characterizing all possible associations between targets and observations, this formulation deals more appropriately with the joint data association problem. The literature proposes two major stochastic centralized approaches to MTT: Markov Chain Monte Carlo (MCMC) and joint particle filter (PF). The former [4, 9, 13] is based on a sequential and iterative sampling of the posterior probability and handles properly the discrete events like jump dynamics (targets appareance/disappareance) but is not parallelizable. The drawbacks are twofold: (1) its non parallelisation on clusters, (2) the speed of the Markov chain which strongly depends on the proposal probabilities and original state. On the other hand, the joint PF [3, 5] samples more efficiently the diffusion dynamics, i.e. the continuous variables, and is easily parallelised. The main reserve remains the exponential increase with the number of tracked targets even if an efficient proposal distribution can limit this phenomenon. From these insights, we design a hybrid tracker which encapsulates the best properties of MCMC and
PF. Moreover, a proposal distribution, dependent on both the diffusion dynamics and the current observations, limits the well-known burst in number of particles and MCMC iterations. This informed proposal based on saliency maps, as far as we are aware, has never been used in a PF framework. The paper is organized as follows. Section 2 depicts the proposed hybrid PF framework. Its implementation an associated experimental results on real-world video sequences are presented in section 3. Finally, in setion 4, a brief summary and future works are discussed.
2. A hybrid PF algorithm The principle of our tracker is depicted in algorithm (1). It is based on the genuine ICONDENSATION framework [2] which enjoys the nice properties to sample thanks to both visual detectors and target dynamics. Algorithm 1 Hybrid MCMC/PF algorithm at frame k. 1: Generate detection saliency maps 2: Generate dynamic model saliency map 3: Generate unified saliency map S 4: for i = 0 to Np do 5: repeat 6: Draw position for particle xik 7: Draw threshold αr 8: until S(xik ) > αr 9: for j = 0 to Ni do 0 10: Draw new state x and threshold αm 0 11: Evaluate proposal probability for x 12: if Proposal probability ≥ αm then 0 13: xik = x 14: else 15: xik = xik 16: end if 17: end for 18: end for 19: Calculate particle weights 20: Select most probable target configuration 21: Calculate mean position of each target using a weighted average of the particles corresponding to the selected configuration This importance function is based on saliency maps which encode information about target dynamics and visual detectors. These maps are then merged (step 3) in a single saliency map that shows all high probability areas for particle placement. All these saliency maps (except for the final merged map, for obvious reasons) are completely independent and so can be computed in
parallel. Besides the parallelisation aspect, data fusion in the importance function brings an important advantage: it allows us to better place the Np particles in high probability areas and so limit drastically the burst in term of particles. The particle sampling is done using a process of rejection sampling (step 5). The position of the i − th particle xik at instant k is randomly drawn from a uniform distribution, and the value for that point in the saliency map S(xik ) is then compared to a threshold (αr ), also randomly drawn. If the value exceeds the threshold it is accepted; else it is rejected and the process is repeated. As this process only involves looking up the value in the saliency map, it is a very computationally cheap process, as long as repetitions are limited to a certain number of iterations (annoted Ni ). It is also possible to do it in parallel, assuming the data for the saliency map has been stored in such a way that it can be accessed in parallel. The result of this process is shown in Figure 2. This combination of saliency maps
Figure 2. Saliency map for sampling the particles.
and rejection sampling ensures that the particles will be placed in the relevant areas of the state space. The process so far deals only with the changes in the continuous parameters (targets’ positions), but assumes the number and identities of the targets remain constant. In order to manage such discrete variables, we have introduced a Markov chain (step 9). In this step, we propose changes to the configuration of the target set for each particle, which are accepted or rejected based on their proposal probabilities. The set of possible changes and proposal probabilities we have used will be defined in section 3. Traditionally, a MCMC process requires a high number of burn-in iterations as the Markov chain must usually, given an initial state, move between continuous and discrete subspaces corresponding respectively to the diffusion/jump dynamics. In our case, the iteration number Ni can be reduced drastically as: (1) the particle set introduces diversity in the jump dynamics, (2) the continuous parameters are handled by the importance sampling. Given the particles sampled in the previous step, the initial state configuration is usu-
ally close enough to the final one that we can practically consider a much reduced number of iterations. In the weighting phase (step 19), the likelihood of each particle is measured and a weight is assigned to each particle. These measurements are detailed in section 3, but for now suffice to say that we make use of both color and motion cues. Finally, we must select the most probable state vector configuration (step 20, 21). For this, we first select the configuration that has the highest cumulative probability, and then compute a weighted average of the continuous parameters for the particles that correspond with the selected configuration.
keep a pre-generated model of the background entries and exits, for this), and Appearance and Disappearance, which correspond to situations where we have to add or remove a target without having a proper entry or exit (for example, for a mislabelled target). Each of these events has an associated proposal probability q(θ|θ0 ), where θ0 is the new proposed state caused by the addition or removal of target x. This probability is calculated as shown hereafter, where P (x) is the probability there is a target in a given location, P (∅) is the probability there is no target, and P (x|∅) and P (∅|x) are the probabilities that a target has appeared or disappeared, respectively. Entry/Appearance event:
3. Implementation issues and experiments
Exit/Disappearance event:
3.1
Generalities 3.3
Our algorithm uses a single state vector for all the targets. This results in a variable length state vector which is parametrized at frame k as xk = {{k1 , {u1 , v1 , s1 }}, ..., {kn , {un , vn , sn }}k . (1) where kn is the ID of the target, un and vn its image position, and sn its scale1 . With regards to the dynamics model p(xk |xk−1 ), the image motions of observed people are difficult to characterize over time. This weak knowledge is formalized by assuming that the continuous entries evolve according to mutually independent Gaussian random walk models viz. p(xk |xk−1 ) = N (xk |xk−1 , Σ), where N (.|µ, Σ) is a Gaussian distribution with mean µ and covariance Σ = diag(σu2 , σv2 , σs2 ). Our saliency map is based on a motion detector and a human detector. The motion detector is based on a multiple Gaussian mixture background subtraction algorithm [10], with some added morphological classification of movement blobs to remove false positives. The human detection is based on an ADABOOST cascade of haar-like features, pioneered for face detection by Viola et al. [11], and then extended to upright whole human bodies [6].
3.2
Markov Chain Monte Carlo events
Our Markov Chain uses four events to describe changes in the target configuration: Entry and Exit, which represent actual entries and exits through either the borders of the image or entrances into the room (we 1 The
index i of particles has been here omitted.
P (x|∅)(1 − P (∅)) . P (∅)(1 − P (x|∅) P (∅|x)(1 − P (x)) q(θ|θ0 ) = . P (x)(1 − P (∅|x) q(θ|θ0 ) =
Particle weight calculation
The weighing of the particles uses both motion and color distribution cues to select the weights. For the motion cue, we calculate the intensity distribution of the subtraction image obtained by subtracting the previous image from the current one, and compare it to an uniform distribution using the Bhattacharyya distance [7]. The color cue is slightly more complicated, as it has to deal with both new and existing targets differently. For new targets, the calculated colour distribution is compared to the colour distribution expected of the background in that point. The greater the distance, the greater the probability that a person is there. For existing targets, we compare with a reference distribution for that target that is taken when the target first enters the scene. All these comparisons are done using also the Bhattacharyya distance [7]. The above measurements are assumed mutually independent, i.e. weak correlation exists between motion and color. The unified measurement in step 19 (algorithm (1)) factorizes simply as their product.
3.4
Experimental results
We have evaluated our approach by tracking through video-sequence of a human-centered environment, namely our robotics hall. The test sequence set is composed of images of a resolution of 320 × 240 pixels taken at 15 fps, using a ceiling mounted camera. We have taken numerous sequences (a total of 1842 images) involving multiple persons that undergo partial or complete occlusion. The tracker runs at 1 fps, for Np = 150 particles and Ni = 3 iterations in the Markov chain, on a P4 2.7GHz PC, with un-optimized C++ code.
Figure 3 illustrates a run on a sequence involving three persons. The entire video and more illustrations are available at the URL http://www.laas.fr/˜izuriarr/. On average, the algorithm tracks the targets correctly on a no-event situation (i.e., the target does not leave or enter the scene) during a 95% of the frames (the 5% of failures accounts for the cases where, as in Figures 3(c) and 3(d), targets are lost due to occlusion). As for event detection, the tracker correctly detects new entrances on their first frame of appearance in a 90% of cases, and a 100% by the third frame. Exits tend to take longer, with only a 40% detected on their first exit frame, and a 90% by the fourth frame. Appearances and disappearances have similar results, with a high percentage of appearance events being detected early on (around 95% by the second frame, after the person becomes visible again), while disappearances take around five frames. Acceptance of false events is uncommon, with an average of one ocurrence per 150 frames, and is generally corrected within a single frame.
(a) Frame 11: A target enters the scene
(b) Frame 39: A second target enters the scene
(c) Frame 108: Target 2 occludes Target 3, so the tracker is confused
(d) Frame 112: A few frames after the occlusion has cleared, target 3 has recovered
Figure 3. Snapshots of a test sequence involving three people.
4. Conclusion and future works In this paper we have proposed a novel PF algorithm, in which particles are placed in high probability areas via the use of saliency maps and a rejection sampling algorithm. Then the particles are run through a short
MCMC process in order to manage entries and exits in a probabilistic fashion. Current investigations concern extending the algorithm by using knowledge of camera model and the assumption that motion is on a known plane; this allows us to make inferences in 3D and account for changes in image due to perspective effects. Our mid-term research goal concerns the algorithm implementation in a manner suitable for an FPGA-based intelligent camera.
References [1] Delta Technologies Soud Ouest (DTSO), Toulouse. www.delta-technologies.fr. [2] M. Isard and A. Blake. ICONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. In European Conf. on Computer Vision (ECCV’98), pages 893–908, London, UK, June 1998. [3] M. Isard and J. MacCormick. BraMBLe: a bayesian multiple blob tracker. In Int. Conf. on Computer Vision (ICCV’01), volume 1, pages 34–41, Vancouver, Canada, July 2001. [4] Z. Khan, T. Batch, and F. Dellaert. An MCMC-based particle filter for tracking multiple interacting targets. In European Conf. on Computer Vision (ECCV’04), pages 279–290, Prague, Czech Republic, May 2004. [5] J. MacCormick and A. Blake. A probabilistic exclusion principle for tracking multiple objects. Int. Journal of Computer Vision (IJCV’00), 39(1):57–71, 2000. [6] G. Monteiro, P. Peixoto, and U. Nunes. Vision-based pedestrian detection using haar-like features. In 6th National Festival of Robotics, Scientific Meeting (ROBOTICA), 2006. [7] P. P´erez, J. Vermaak, and A. Blake. Data fusion for visual tracking with particles. IEEE, 92(3):495–513, 2004. [8] W. Qu, D. Schonfeld, and M. Mohamed. Distributed bayesian multiple-target tracking in crowded environments using multiple collaborative cameras. EURASIP Journal on Advances in Signal Processing, 2007. [9] K. Smith, D. Gatica-Perez, and J. Odobez. Using particles to track varying numbers of interacting people. In Computer Vision and Pattern Recognition (CVPR’05), pages 962–969, San Diego, USA, June 2005. [10] C. Stauffer and W. Grimson. Adaptative background mixture models for real-time tracking. In Computer Vision and Pattern Recognition (CVPR’99), volume 2, pages 22–46, Fort Collins, USA, June 1999. [11] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition (CVPR’01), Hawaii, December 2001. [12] T. Yu and Y. Wu. Collaborative tracking of multiple targets. In Computer Vision and Pattern Recognition (CVPR’04), pages 834–841, Washington, USA, June 2004. [13] T. Zhao and R. Nevatia. Tracking multiple humans in crowded environment. In Computer Vision and Pattern Recognition (CVPR’04), Washington, USA, June 2004.