Int J Comput Vis DOI 10.1007/s11263-010-0373-3
Multi-Camera Tracking with Adaptive Resource Allocation Bohyung Han · Seong-Wook Joo · Larry S. Davis
Received: 19 July 2009 / Accepted: 1 July 2010 © Springer Science+Business Media, LLC 2010
Abstract Sensor fusion for object tracking is attractive since the integration of multiple sensors and/or algorithms with different characteristics can improve performance. However, there exist several critical limitations to sensor fusion techniques: (1) the measurement cost increases typically as many times as the number of sensors, (2) it is not straightforward to measure the confidence of each source and give it a proper weight for state estimation, and (3) there is no principled dynamic resource allocation algorithm for better performance and efficiency. We describe a method to fuse information from multiple sensors and estimate the current tracker state by using a mixture of sequential Bayesian filters (e.g., particle filter)—one filter for each sensor, where each filter makes a different level of contribution to estimate the combined posterior in a reliable manner. In this framework, multiple sensors interact to determine an appropriate sensor for each particle dynamically; each particle is allocated to only one of the sensors for measurement and a different number of particles is assigned to each sensor. The level of the contribution of each sensor changes dynamically based on its prior information and relative measurement confidence. We apply this technique to visual track-
B. Han () Dept. of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), Pohang, Korea e-mail:
[email protected] S.-W. Joo Google Inc., Mountain View, CA, USA e-mail:
[email protected] L.S. Davis Dept. of Computer Science, University of Maryland, College Park, MD, USA e-mail:
[email protected]
ing with multiple cameras, and demonstrate its effectiveness through tracking results in videos. Keywords Object tracking · Resource allocation · Multi-camera tracking · Sensor fusion · Kernel-based Bayesian filtering · Mixture model
1 Introduction Recent advances in video technology and the reduction of sensor prices enable many computer vision systems, such as visual surveillance, video conferencing, virtual reality, etc., to employ multiple sensors for the development of new functions and the improvement of system performance. For tracking, the integration of multiple sensors and/or tracking algorithms has the potential advantage of fusing complementary characteristics of different sensors and algorithms. However, the integration process is not straightforward in general, and requires additional cost for measurement and computation. Most of all, it is not clear how to allocate finite resources to each sensor (or algorithm) and how much each source should be relied on for final state estimation, especially in large-scale systems. There are various kinds of sensor fusion algorithms for visual tracking. Fusion in the measurement step is the most typical method, where a single posterior is obtained by integrating multiple cues. Some tracking algorithms with sensor fusion are based on ad-hoc merge processes. For example, edge and color features are integrated to track elliptical objects in Birchfield (1998). Multiple cues—motion, color, shape, etc.—are integrated heuristically to overcome the limitations of the individual modalities in Spengler and Schiele (2003), Triesch and von der Malsburg (2001). Simple sequential Bayesian
Int J Comput Vis
filtering is utilized for fusion-based tracking (Azoz et al. 1998), where color, motion and shape features are combined using a variation of the Extended Kalman Filter (EKF). An effective tool for fusion-based tracking task is the particle filter, where a pre-defined number of particles (samples) are drawn and the final likelihood of each sample is typically computed based on information obtained from all sensors. Yang et al. (2005) combine color rectangle and edge feature to improve observation quality and the fusion of video and audio sensors for object tracking is described in Rui and Chen (2001), Vermaak et al. (2001). Isard and Blake (1998) integrated skin color detection to obtain better proposal distribution for contour tracking. In Perez et al. (2004), generic importance sampling mechanisms for data fusion are introduced, and Chen and Rui (2004) propose a combination of top-down and bottom-up approaches to fuse multiple sensing modalities such as color, sound, and contour. Even though particle filter has been usefully applied to tracking for sensor fusion, their implementations have been mostly limited to – combining measurements from multiple sensors using the simple product of likelihoods obtained from the same sample, and/or – allocating the fixed number of particles to each sensor regardless of its reliability. More robust observations would be expected by such integration of multiple sensors, but the cost of the observation also increases in proportion to the number of sensors. Moreover, assigning a fixed number of particles to each sensor, regardless of the reliability of the sensors, leads to a potential waste or shortage of samples. This problem is aggravated in large-scale systems, where many sensors are involved, so an intelligent resource allocation algorithm is an important issue. Also, the blind integration of multiple sensors may corrupt the entire observation if some non-discriminative and/or noisy sensors are involved in the measurement process. More sophisticated inference techniques based on graphical models have also been proposed. The co-inference of tracker state based on shape and color is optimized jointly in Wu and Huang (2001). Cue dependency is defined with a graphical model and cue integration is performed by Bayesian inference in Sherrah and Gong (2001), Zhong et al. (2006). Sherrah and Gong (2001) proposed heuristics to estimate the reliability of each modality based on relation between modalities. However, the relations are subjective and hard to be generalized, and there is no discussion of their performance. The graphical model employed in Zhong et al. (2006) might not be used in system with many sensors due to its complexity. Another class of sensor fusion methods are in algorithm level methods, where trackers run independently and there
is a post-processing step to merge the results for the estimation of the target state. People tracking results produced by multiple algorithms are merged by a heuristic in Siebel and Maybank (2002), and feature motions observed independently are combined by classification between inliers and outliers and cross-validation between trackers in Mccane et al. (2002). Also, Leichter et al. (2006) proposed a fusion technique of multiple tracking algorithms within a Bayesian framework. Recently, the mixture particle filter has been proposed to maintain multi-modality in particle filters by modeling the posterior distribution as a non-parametric mixture model (Vermaak et al. 2003). This technique has been successfully applied to multi-object tracking in a single camera setting (Okuma et al. 2004; Vermaak et al. 2003). The mixture Kalman filter (Chen and Liu 2000) and the Interacting Multiple Model (IMM) methods (Musicki 2008) have similar ideas but a continuous density function—Gaussian mixture—is used instead of discrete ones. Although previous fusion-based tracking algorithms attempted to integrate multiple cues with a heuristic or probabilistic framework, they do not provide good solutions to measuring the reliability of each cue and how to utilize that reliability for target state estimation and resource allocation. To address these problems, we propose the mixture kernelbased Bayesian filters, where a mixture of the posteriors is propagated in a sequential Bayesian framework and an appropriate sensor is selected probabilistically for measurement. The mixture kernel-based Bayesian filtering (mKBF) was first proposed for sensor fusion in Han et al. (2007), and this paper is based on a new analysis of the update step for the fusion process and resource allocation. The important features of our technique are described below. – Our method is close to algorithm level fusion, as the fusion process is performed in the update step. The combined posterior is constructed as a mixture of individual posteriors, which are continuous functions—a mixture of Gaussians. – The individual posteriors contribute different weights to the combined posterior in proportion to the prior and measurement confidence. By adopting a weighted mixture model for the posterior instead of a single probability density function, the posterior estimation is more accurate; it is effective to represent multi-modal density and gives more weight to reliable sensors for robust state estimation. Therefore, tracker performance can be improved, especially in the presence of clutter and occlusion. – There is a significant interaction among sensors in the measurement step, and not all sensors are necessarily used for the measurement of each sample. Instead, the sensor for which the actual observation is made is determined probabilistically based on the expected likeli-
Int J Comput Vis
hood for each sample. The proposal distribution is constructed from prior knowledge, as well as partial observations from each sensor. The sensor selection provides a framework to allocate an adaptive number of particles to each sensor based on its reliability.1 This cannot easily be done in conventional particle filters based on discrete distributions, since the probability for an arbitrary location in the state space is not available, so the expected likelihoods cannot be obtained; it is possible in kernel-based Bayesian filtering, where all the relevant density functions are represented with a mixture of Gaussians. This approach is applied to visual tracking with multiple cameras. Also, tracking in the presence of sensor failures is evaluated; one of the cameras provides completely noisy signals sometimes, which is handled by dynamic sensor weighting and adaptive particle allocation within mixture kernel-based Bayesian filtering framework. The rest of this paper is organized as follows. In Sect. 2, kernel-based Bayesian filtering (Han et al. 2009) is reviewed, and our sensor fusion technique is described in Sect. 3. The application to visual tracking and its performance are demonstrated in Sect. 4.
where C = p(zt |xt )p(xt |z1:t−1 )dxt is a normalization constant independent of xt . The posterior probability at time step t, p(xt |z1:t ), is used as a prior in the next step. When the prior is represented as a weighted mixture of Gaussians, the same mixture representation is propagated to the next time step through prediction and update steps. 2.2 Kernel-Based Bayesian Filtering Denote by xit (i = 1, . . . , nt ) a set of mean vectors in Rd and step t. by Pit the corresponding covariance matrices tat time ωti = 1, and Let each Gaussian have a weight ωti with ni=1 let the prior density function be given by
2.1 Overview The state variable xt (t = 0, . . . , T ) in sequential Bayesian filtering is characterized by its probability density function conditioned by the history of measurements zt (t = 1, . . . , T ). The conditional density function of the state variable given the measurements is propagated through prediction and update stages in a Bayesian framework p(xt |z1:t−1 ) = p(xt |xt−1 )p(xt−1 |z1:t−1 )dxt−1 (1)
1 The
1 p(zt |xt )p(xt |z1:t−1 ), C
(2)
sample depletion in a sensor does not happen since a minimum number of particles is always allocated to each sensor and the sensor reliability can be obtained effectively with the minimum number of particles in our framework.
(3)
where N (m, ) represents a normal distribution with mean m and covariance . In the prediction step, the Unscented Transformation (UT) (Julier and Uhlmann 1997; van der Merwe et al. 2000) is applied to each mode in the prior so that non-linear process models can be handled. Using the UT, the prior is transformed to another mixture of Gaussians as follows:
nt−1
In this section, we provide a brief summary of Kernelbased Bayesian Filtering (KBF) introduced in Han et al. (2009). The kernel-based Bayesian filtering is a state estimation technique by propagating a Gaussian mixture density function in the sequential Bayesian filtering framework. The main difference between this technique and (extended) Kalman filter is that the posterior is not Gaussian any more and the multi-modality in the state estimation can be effectively modeled using a Gaussian mixture density function.
i ωt−1 , N (xit−1 , Pit−1 ),
i=1
2 Background
p(xt |z1:t ) =
nt−1
p(xt−1 |z1:t−1 ) =
p(xt |z1:t−1 ) =
ωˆ ti , N (ˆxit , Pˆ it ),
(4)
i=1 i , and x ˆ it and Pˆ it are the transformed mean where ωˆ ti = ωt−1 and covariance, respectively. This non-linear transformation is accurate up to second order. Density interpolation based on the Non-Negative Least Square (NNLS) method (Adlers 2000; Cantarella and Piatek 2004; Lawson and Hanson 1974) is incorporated to parametrize the measurement density with a Gaussian mixture, and the measurement function with mt Gaussians at time t is given by
p(zt |xt ) =
mt
τti N (xit , Rit ),
(5)
i=1
where τti is the unnormalized weight from the solution of NNLS and Rit is the covariance associated with the mean xit (i = 1, . . . , mt ). In the update step, the posterior is obtained by the products of the Gaussian pairs between prediction and measurement density ((4) and (5)). Even though the derived density function is also a weighted Gaussian mixture, the exponential increase of Gaussian components in the mixture during the propagation makes the whole procedure intractable. In order to avoid this problem, kernel density approximation technique (Han et al. 2008) is applied. It allows us to maintain a compact and accurate density representation even after
Int J Comput Vis
many stages of density propagation. Alternative methods for reducing the number of components in a Gaussian mixture can be found in Salmond (1990), Williams and Maybeck (2003). After the update step, the final posterior distribution is given by p(xt |z1:t ) =
nt
ωti N (xit , Pit ),
(6)
i=1
where nt is the number of modes at time step t and the sum of ωti is equal to 1. 2.3 Discussion of Kernel-Based Bayesian Filtering Kernel-based Bayesian filtering has an advantage over conventional particle filters based on discrete density functions. It is generally known that a continuous proposal distribution can improve the quality of sampling (Doucet et al. 2001), so the natural filtering algorithm based on continuous density functions ameliorates degeneracy or the loss of diversity problem. The kernel-based Bayesian filter shows equivalent accuracy with a smaller number of samples compared with conventional particle filters (Han et al. 2009). There is another important characteristic of kernel-based Bayesian filtering. Unlike the particle filters based on discrete probability density functions, the probability at an arbitrary location in the state space can be computed analytically in the kernel-based Bayesian filter. This property plays an important role in computing the expected likelihood of each sample before the “real” observation, and will be utilized in our sensor fusion framework. In the next section, we explain how the kernel-based Bayesian filtering framework is employed for sensor fusion.
mixture particle filter shows superior performance in preserving and tracking multiple modes in the posterior density function (Vermaak et al. 2003). Explicit mixture modeling is often helpful in practice due to the capability of modeling multi-modal characteristics observed in multi-sensor tracking. In our framework, each mixture component in (7), pk (xt−1 |z1:t−1 ), represents an independent dynamic system2 that has a separate measurement model. Note that the combined posterior induced by the weighted sum is also a reasonable alternative to estimate target state with multiple sensors as in Blom and Bar-Shalom (1988), Hoffmann and Dang (2009), Jia et al. (2008), Mazor et al. (1998), Musicki (2008), Musicki and La Scala (2008) although it is not a standard method to combine the measurements from multiple sensors by Bayesian way—product of likelihood densities. Our purpose is to preserve the mixture representation through the iterations of sequential Bayesian filtering. The overall procedure for an individual Bayesian filter is similar to the description in Sect. 2, and we next explain how to combine the information from multiple sensors and how sensors interact with each other. 3.1 Prediction Step and Proposal Distribution We make a prediction for an individual Bayesian filter independently using the unscented transformation as described in Sect. 2.2, and the predicted density function is then given by p(xt |z1:t−1 ) =
K
k πt−1
pk (xt |xt−1 )pk (xt−1 |z1:t−1 )dxt−1
k=1
3 Fusion Tracking by Mixture KBF
=
Suppose that we have K sensors and want to fuse data from those sensors. If the mixture weights of the sensors are given i by πt−1 (i = 1, . . . , K) at time t − 1, the posterior at time step t − 1—also the prior at time t—is defined as p(xt−1 |z1:t−1 ) =
K
k πt−1 pk (xt−1 |z1:t−1 ),
(7)
k=1
where pk (xt−1 |z1:t−1 ) is the posterior of an individual sensor at time t − 1, which is a mixture of Gaussians. Mixture models for density propagation in the sequential Bayesian filtering framework have been previously used to handle multi-modality. Interacting Multiple Model (IMM) filters are employed to model multiple dynamics (Blom and Bar-Shalom 1988; Jia et al. 2008; Mazor et al. 1998; Musicki 2008; Musicki and La Scala 2008) or multiple measurements (Hoffmann and Dang 2009) effectively. Also, the
K
k πt−1 pk (xt |z1:t−1 ).
(8)
k=1
Since our method selects a sensor for observation probabilistically, the proposal distribution is very important to overall performance. In the particle filter framework, there are several techniques to improve the proposal distribution, such as the use of an auxiliary tracker with different features (Isard and Blake 1998), unscented particle filter (van der Merwe et al. 2000; Rui and Chen 2001), and multi-stage sampling (Han et al. 2009; Okuma et al. 2004). We combine the prior and partial observation distribution from each individual filter through a 2-stage sampling scheme to construct the proposal distribution, which improves the effectiveness of particles. Since the posterior in 2 It is not completely independent since there are significant interactions in sampling and measurement steps, but the posterior density function is propagated independently.
Int J Comput Vis
the previous step in (7) is based on combining information from all of the sensors, it should be more reliable than the individual posteriors. So, the initial proposal distribution, denoted by q 1 (xt |xt−1 , z1:t ), is common for every sensor and is equal to the predicted distribution in (8):
to only one sensor, probabilistically, for observation by considering the prior and likelihood expectation. The probability that the k-th sensor is selected is given by p(sel(i) = k) (i)
q 1 (xt |xt−1 , z1:t ) = p(xt |z1:t−1 ).
(9)
In the second stage, the proposal distribution for each sensor, qk2 (xt |xt−1 , z1:t ), is determined by the combination of the initial proposal distribution and the partial observations from each sensor as follows: qk2 (xt |xt−1 , z1:t ) = (1 − α)q 1 (xt |xt−1 , z1:t ) + αpk1 (zt |xt ),
(10)
where pk1 (zt |xt ) is the initial normalized measurement density and α is a constant in [0, 1]. The combined proposal distribution is given by q 2 (xt |xt−1 , z1:t ) =
K
k πt−1 qk2 (xt |xt−1 , z1:t )
k=1
=
K
k πt−1 ((1 − α)q 1 (xt |xt−1 , z1:t ) + αpk1 (zt |xt )). (11)
k=1
This 2-stage sampling strategy improves the sampling quality and can reduce the number of samples required, since the proposal distribution combines the priors of all sensors and the partial observations in the current step. 3.2 Measurement Step The measurement step is also composed of two stages in accordance with the 2-stage sampling. The main purpose of the 2-stage sampling is to improve the proposal distribution in a progressive manner. By assigning a fixed number of particles to each sensor in the first stage, the degeneracy problem—the situation that no particle is drawn from one or more sensors and the measurement density does not become available—can be avoided. This situation may happen when only a couple of sensors dominate the posterior due to their strong measurement likelihoods and the rest of the sensors have negligible mixture weights. In the first stage, the samples are drawn from the common proposal distribution q 1 (xt |xt−1 , z1:t ), and the observations are made in all sensors for the same locations in the state space. Then, the initial observation resulting in each sensor pk1 (zt |xt ) (k = 1, . . . , K) is reflected in the proposal distribution for the next stage, as shown in (11), from which samples are drawn. In the second stage, each sample is assigned
(i)
π k (βpk (xt |z1:t−1 ) + (1 − β)pk1 (zt |xt )) , = K t−1 j (i) (i) 1 j =1 πt−1 (βpj (xt |z1:t−1 ) + (1 − β)pj (zt |xt )) (12) where sel(i) is the selected sensor number for the i-th sample, β is a constant in [0, 1], pk (x(i) t |z1:t−1 ) is the prediction probability of the i-th sample in the k-th sensor, and pk1 (zt |x(i) t ) is the likelihood of the i-th sample given the initial measurement density. The sensor selection for the i-th sample is given by s p(sel(i) = k) > ri (13) sel(i) = arg min s
k=1
where ri is a random number from a uniform distribution on [0, 1). This procedure is similar to the E-step of the EM algorithm, and cannot easily be done in conventional particle filters based on discrete density functions, since it is difficult to obtain probabilities at arbitrary locations. The sensor expected to produce the highest likelihood is prioritized for observation, and is given more samples to improve the robustness of the measurement density. The sampling and measurement procedure in the second stage is illustrated in Fig. 1. The multi-stage measurements performed in the individual filter is identical to the kernel-based Bayesian filter (Han et al. 2009), where a density interpolation technique based on the non-negative least square method is used to obtain measurement density functions. The individual measurement density function of the k-th sensor at time step t, un-normalized measurement density function, p˜ k (zt |xt ) is given by p˜ k (zt |xt ) =
mt,k
i κt,k N (xit,k , Rit,k ),
(14)
i=1 i is an unwhere mt,k is the number of components, κt,k normalized weight of each Gaussian component, and xit,k and Rit,k are the mean and covariance in the k-th measurement density, respectively.
3.3 Update Step In the update step, the prior and the measurement information are combined to construct the posterior for each sensor, and the individual posteriors are combined to derive
Int J Comput Vis
× =
pk (zt |xt )pk (xt |z1:t−1 ) pk (zt |xt )pk (xt |z1:t−1 )dxt
K 1 k πt−1 ψtk pk (zt |xt )pk (xt |z1:t−1 )dxt pk (xt |z1:t ) C k=1
=
K
πtk pk (xt |z1:t )
(15)
k=1
where C=
K
k πt−1 p˜ k (xt |z1:t )dxt
(16)
k=1
is the normalization constant, Fig. 1 An example of sampling and measurement procedure in the second stage. The proposal distribution qk2 (xt |xt−1 , z1:t ) is constructed from the prior and the partial measurement density function of the k-th sensor, and q 2 (xt |xt−1 , z1:t ) is the mixture of qk2 , (k = 1, 2). (Top) The samples such that p(sel(i) = 1) ≥ p(sel(i) = 2) are represented with red (shaded) circles, and the rest are represented with blue (hollow) circles. The sensor selection for each sample is performed by (13). (Bottom) Because the sensor selection for each particle is probabilistic, red and blue particles are mixed in each sensor. Based on the measurements of each sensor, the final measurement density functions are constructed by density interpolation
the overall posterior probability density function. Recall (7), where the overall prior is defined as a mixture of normalized individual priors. After one more step of sequential Bayesian filtering, the un-normalized posterior of each sensor models the relative confidence in target state estimation, which is originated only from the current time step. Therefore, the sum of the product of mixture weight at the previous time step and un-normalized posterior provides the fusion-based posterior. From this form of the posterior, we need to obtain the same representation as (7). Suppose that p˜ k (xt |z1:t ) and p˜ k (zt |xt ) are un-normalized posterior and measurement density for the k-th sensor at time step t, respectively; then the overall posterior is given by p(xt |z1:t ) ≡
K 1 k πt−1 p˜ k (xt |z1:t ) C k=1
=
K 1 k πt−1 p˜ k (zt |xt )pk (xt |z1:t−1 ) C k=1
=
K 1 k πt−1 ψtk pk (zt |xt )pk (xt |z1:t−1 ) C k=1
=
K 1 k πt−1 ψtk pk (zt |xt )pk (xt |z1:t−1 )dxt C k=1
ψtk
=
mt,k
i κt,k
=
p˜ k (zt |xt )dxt
(17)
i=1
is the measurement confidence for each sensor, and 1 k k k πt = πt−1 ψt pk (zt |xt )pk (xt |z1:t−1 )dxt C
(18)
is the new mixture weight for the k-th component at time t. Note that the measurement density function p˜ k (zt |xt ) and the measurement confidence ψtk are hardly affected by the number of samples as illustrated in Fig. 2. This is because the measurement density function can be reconstructed from a small number of control points based on the non-negative least square and the measurement density at the unsampled locations can be interpolated algebraically; the extra samples improve the details of the density function, but its overall shape is determined by the small number of samples with high likelihoods. This is a very important property since we allocate a different number of samples for each sensor, and the confidence of the sensor—integration of measurement density function—needs to be invariant to the number of samples for accurate weight estimation for the sensor; otherwise, the sensor with more samples is subject to consistently having more weight or we need to normalize the confidence of each sensor based on the number of samples, which is not stable enough according to our simulation. The new mixture weight πtk is proportional to the previous mixture weight, measurement confidence and an integration term. The integration term, pk (zt |xt )× pk (xt |z1:t−1 )dxt , models the consistency between predicted and measurement density function. Since both density functions are Gaussian mixtures, their product is also a mixture of Gaussians and the integration is equal to the sum of the weights of Gaussians in the new mixture. nt−1 i Let pk (xt |z1:t−1 ) = i=1 ωt,k N (xit,k , Pit,k ) and pk (zt |xt ) = mt i i i i=1 τt N (xt,k , Rt,k ) denote the predicted and measure-
Int J Comput Vis Fig. 2 Comparison of measurement density functions with different number of samples. As illustrated, the measurement density functions by density interpolation based on the non-negative least square are almost invariant to the number of samples. The measurement confidences, ψ , are approximately 1.4 with 100 samples and 1.3 with 25 samples for both cases in average
ment density function, respectively. Then, the product of two density functions is given by nt−1 mt
ij
ij
ij
(19)
ωt N(mt , t ),
weighted mixture density function is propagated in the framework of mixture KBF. 4.1 Implementation Issues
i=1 j =1
where ij
j
j
ωt = κti τt N (xit , Pit,k + Rt,k ) ij ij j j mt = t (Pit,k )−1 xit + (Rt,k )−1 xt −1 ij j t = (Pit,k )−1 + (Rt,k )−1 .
(20) (21) (22)
Therefore, the integration is given by pk (zt |xt )pk (xt |z1:t−1 )dxt =
i
j
j
κti τt N (xit , Pit,k + Rt,k ),
(23)
j
where it will be larger when two density functions are more similar to each other.
4 Experiments The proposed sensor fusion technique is applied to visual tracking problem with multiple cameras, where the
Multi-camera tracking is advantageous mainly because critical challenges to single camera trackers such as occlusions can be reduced. There has been a significant amount of prior work on tracking using multiple cameras (Black and Ellis 2006; Dockstader and Tekalp 2001; Khan and Shah 2006; Kim and Davis 2006; Mittal and Davis 2003), but few attempts have been made to control the degree of contribution from each of the cameras. Our sensor fusion technique provides a methodology to dynamically adjust the degree of contribution of each camera based on observation history and to improve performance by adaptive resource allocation. In this section, we describe how to implement information fusion from each camera using our probabilistic framework, and demonstrate tracking results. We assume objects are moving on a ground plane and all cameras have some common field of view of those objects. The common state space is defined as the 2D location (x, y) in the canonical top view, and the state vector is transformed into each view for an observation using the ground plane homography. Even though the cameras are static, no background subtraction information is used for tracking.
Int J Comput Vis
The process model is the random walk, and the likelihood of each sample is based on the similarity of the RGB color histogram between the target and the candidates. The measure for histogram comparison is Bhattacharyya distance. The measurement process is performed in each camera independently, so the observation for camera k is given by pk (zt |xt ) = pk (zkt |Tk (xt )),
(24)
where zkt represents the observation data in camera k and Tk (·) denotes the transformation of the common state into the corresponding view. A non-trivial problem in the measurement with multiple cameras is that the absolute values of likelihoods are not normalized properly across cameras. Therefore, the measurement confidence ψtk may have significantly different values due to different characteristics of cameras, which makes the direct use of likelihood values inappropriate. So, instead of computing distances between target and candidate histograms, the likelihood for each sample is obtained by computing the ratio of candidate-target distance to candidateuniform distribution distance. The likelihood of the i-th (i) sample, pk (zt |xt ) is defined by
D 2 (q, pi ) , (25) pk (zt |x(i) ) ∝ exp −λ t D 2 (q, u) where D 2 (·, ·) is squared Bhattacharyya distance between two histograms. Also, q, p, and u are normalized target, candidate and uniform histogram, respectively, and λ is a constant. The denominator in (25) can be quite different in each camera, especially when the color characteristics of camera sensors are different. This method is practically effective for normalizing the likelihoods from different cameras. Throughout our experiments, the RGB color histograms are constructed based on 16 × 16 × 16 bins and we used consistent parameter values: α = β = 0.5 and λ = 30. According to our experiments, small changes to these parameter values have negligible effects on tracking results; the performance is slightly worse with α = β = 0.3, but is essentially unaffected over variations of λ between 15 and 30. 4.2 Results We tested our method on a sequence captured by two cameras in which walking people are tracked. The appearance model for a person is constructed based on two separate histograms—one for the upper and the other for the lower body, and we compute the joint likelihood. Figure 3 illustrates the result of tracking two persons using two cameras in an indoor environment.3 In this example, 50 measurements are used—5 samples are given to each 3 We
measured the height of each person approximately, and used it for the measurement and visualization.
Fig. 3 A tracking example. Results in camera 1 (top) and 2 (bottom) are presented at 3 different time steps. Person 1 and 2—blue and green bounding box, respectively—are successfully tracked in the presence of multiple dynamic occlusions
camera in the first stage and 40 samples are dynamically allocated to the two cameras in the second stage. Even with frequent occlusion and clutter, the person is successfully tracked throughout the sequence by the adaptive collaboration of two cameras. Also, the mixture weights and the number of particles assigned to each camera are updated at each frame depending on visibility and the distinctiveness of the target in each view, which are illustrated in Fig. 4. We also compared tracking performance of our method and a conventional product-of-likelihood fusion algorithm by particle filter, which is presented in Fig. 5. The sequence for this experiment is similar to the one used for Fig. 3, but there are many severe dynamic occlusions between two people whose appearances are very similar because both are wearing white T-shirts. The same number of measurements (50 altogether) are used for both methods; in the case of our method, 5 samples are drawn at the first stage of measurement step, and 40 samples are then dynamically allocated to both cameras. After the first occlusion, both tracking algorithms recovered from short-term failures but the conventional fusion method based on particle filtering lost the target after the second occlusion. On the other hand, our method succeeded in tracking the target even after the second occlusion. In our method, the mixture weight and the number of observations in camera 2 during the occlusion (around t = 60) are consistently much more than camera 1 as illustrated in Fig. 6. It suggests that tracking by mixture KBF is successful because the utilization of the more reliable sensor (camera 2) is prioritized by the dynamic sample allocation. In many computer vision systems, it is common that some sensor data is temporarily missing or is totally unreliable due to sensor noise, occlusion, hardware/software error, etc. Suppose that one of the cameras fails temporarily as in Table 1, which is simulated by replacing the original image with completely noisy signals. The performance of our sensor fusion tracking algorithm, mKBF, is also tested in
Int J Comput Vis Fig. 4 Mixture weights and particle allocations for each person and each sensor in each frame. Blue and red lines represents camera 1 and camera 2, respectively. Note that the mixture weights and particle assignments mostly correspond to visibility of targets
the presence of sensor failures and compared with the other tracking algorithms based on sensor fusion as follows: KBF: Tracking by KBF based on the standard sensor fusion technique (same number and locations of samples in each sensor, product-of-likelihood fusion) PF: Tracking by Particle Filter (PF) based on the standard sensor fusion technique (same number and locations of samples in each sensor, product-of-likelihood fusion) mKBFe : Tracking by mKBF without adaptive resource allocation (same number of samples in each sensor, but different locations, sum-of-posterior fusion) To track people in our algorithm, 10 particles are used at the first stage of measurement and 30 particles are distrib-
Table 1 Frames with camera failures Camera
Frames with camera failures
1
581 ∼ 670
2
501 ∼ 575
3
366 ∼ 415
uted to 3 cameras, resulting in a total of 60 observations. On the other hand, 20 particles are distributed evenly to each camera in the other three algorithms, and the total number of measurements is the same as our method. Figure 7 illustrates the result of multi-object tracking using three cameras in an outdoor scene. In spite of frequent
Int J Comput Vis
occlusions amongst the group of people and temporary sensor failures, tracking by mKBF with adaptive resource allocation is successful for the entire 900 frames. Tracking results of three other algorithms are less stable than the pro-
Fig. 5 Comparison between mixture KBF and conventional fusion by particle filter. The results at time t = 18, 54, 67 are presented for each algorithm in (a) and (b), where the first and second row represent results in camera 1 and camera 2, respectively. Note that the target is lost after the second occlusions around t = 60 in (b)
Fig. 6 Mixture weights and particle assignments for each sensor in each frame. Blue and red lines are for camera 1 and camera 2, respectively. Note that the mixture weight and the number of particles are significantly larger in camera 2 during the second occlusions around t = 60
posed technique. The mixture weight for each camera is presented in Fig. 8. As observed, the mixture weight for the failed sensor was set to a negligible number so that the minimum number of particles was allocated to failed sensors. Figure 9 illustrates the trajectories of four people, where numerous occlusions can be observed. We also performed a quantitative comparison for the four different algorithms, where the groundtruths are created manually and the error is measured with the Euclidean distance between the groundtruth and tracking results computed in the canonical top-view plane. The quantitative comparison results are illustrated in Fig. 10, which are the average of 10 independent runs for each algorithm. They show that the performance of our method is better than other algorithms mostly with the same number of measurements. The tracking errors using KBF have high variations although tracking results are sometimes very accurate, and the errors and error variations of PF are consistently higher than our method. The performance of mKBFe is close to our algorithm, but it has noticeably higher errors in tracking person 3. Now, we show the benefit of mKBF by comparing the tracking errors in mKBF with the errors in KBF and PF with more observations; tracking with 60, 90, and 120 measurements are performed 10 times and averaged, where 20, 30, and 40 samples are given to each sensor, respectively. The performance of mKBF with 60 measurements is almost equivalent with that of KBF and PF with 120 measurements although KBF is slightly better than PF; the error variance of mKBF with 60 measurements is lower than KBF and PF with 120 observations by more than 30%. These results are presented in Fig. 11.
Int J Comput Vis
Fig. 7 Comparison of people tracking in presence of temporal sensor failures. (Col1) mKBF, (Col2) KBF, (Col3) PF, (Col4) mKBF with even sample distribution. (Top) camera 1, (middle) camera 2, (bottom) camera 3. The errors are significant in person 2 (green) and 4 (yellow)
at t = 529 and person 3 (magenta) at t = 685. Note that the signal from camera 2 is completely noisy at t = 529 but we presented a normal image to show tracking performance effectively
5 Conclusion
observations become more robust and the effectiveness of particles is improved. We applied our algorithm to various sensor fusion cases in the multi-camera tracking scenarios, and presented tracking results in the presence of severe occlusions, clutter, and sensor failures. Our experiment shows that tracking by mKBF is better than other sensor fusion techniques such as KBF, PF and mKBFe , qualitatively and quantitatively.
We presented a probabilistic sensor fusion technique based on mixture kernel-based Bayesian filtering. This framework provides a methodology to select effective sensors for measurements in a probabilistic manner and to maintain the multi-modality of the combined posterior density function. By assigning particles to a sensor based on its reliability,
Int J Comput Vis
Fig. 8 Mixture weight for each sensor in people tracking sequence. Blue, red and green area denote camera 1, 2, and 3, respectively. The mixture weights changes dynamically due to various reasons such as target visibility, appearance changes and so on
Fig. 10 Comparison of errors and error variations for the four different sensor fusion algorithms for each person. The labels in x-axis denote person IDs (the colors of bounding boxes in Fig. 7)
Fig. 9 Trajectories for four people in space-time
Acknowledgements This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0003496).
Appendix: Logical Sensor Fusion We also applied the fusion technique to object tracking using multiple logical sensors; the features (sensors) used are (1) color, (2) gradient, (3) template, and (4) contour. For the color and the gradient sensor, the target appearance is modeled using histograms and the Bhattacharyya distance is used to compute likelihoods. The template sensor mea-
sures the mean squared difference of the color pixels in a smoothed image template. Finally, the contour sensor uses the magnitude of the gradients along the normal direction around the perimeter of an ellipse. For each sensor, tracking is performed independently by KBF, but all the sensors interact and compete for dynamic particle allocation as explained in Sect. 3. The object is tracked in a 4D state space consisting of image location (x, y), in-plane rotation, and scale, and the random walk is chosen as the process model. Figure 12 presents the results of tracking with four different sensors by the mKBF. Our algorithm tracked a target successfully under significant pose variations and severe appearance changes for the entire 500 frames. The number of samples drawn is 90 altogether—10 in the first stage and 80 in the second stage, so the total number of observations in all sensors combined is 120. The mixture weight and sample allocations for each sensor are presented in Fig. 13, where
Int J Comput Vis
Fig. 11 Comparison of errors between mKBF, KBF, and PF with varying number of measurements. The error bars for KBF and PF are obtained for the 3 different numbers of measurements, and the error for mKBF with 60 measurements is illustrated with a blue dotted line
Fig. 12 Tracking results by logical sensor fusion. (Top) Results in the color images. (Bottom) Results in the gradient images. The gradients in the x and y direction are mapped to R and G space in the gradient images, respectively
the dynamic changes of the mixture weights and the number of samples are observed; the mixture weight for contour feature is significantly high when the back of woman’s head is shown and the gradient feature has high weights occasionally.
References Adlers, M. (2000). Topics in sparse least squares problems. PhD thesis, Linköpings Universitet, Sweden. Available at http://www.math.liu.se/~milun/thesis. Azoz, Y., Devi, L., & Sharma, R. (1998). Reliable tracking of human arm dynamics by multiple cue integration and constraint fusion. In Proc. IEEE conf. on computer vision and pattern recognition, Santa Barbara, CA. Birchfield, S. (1998). Elliptical head tracking using intensity gradients and color histograms. In Proc. IEEE conf. on computer vision and pattern recognition, Santa Barbara, CA (pp. 232–237).
Fig. 13 Mixture weights and resource allocation results in each frame. (Blue) Color, (red) Gradient, (green) Template, (magenta) Contour. The mixture weight for contour feature is significantly high when the back of woman’s head is shown around t = 90 ∼ 110 and t = 180 ∼ 240 Black, J., & Ellis, T. (2006). Multi camera image tracking. Image and Vision Computing Journal, 24(11), 1256–1267. Blom, H., & Bar-Shalom, Y. (1988). The interacting multiple model algorithm for systems with Markovian switching coefficients. IEEE Transactions on Automatic Control, 33(8), 780–783. Cantarella, J., & Piatek, M. (2004). tsnnls: A solver for large sparse least squares problem with non-negative variables. Preprint, available at http://www.cs.duq.edu/~piatek/tsnnls/. Chen, R., & Liu, J. (2000). Mixture Kalman filters. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 62(3), 493– 508. Chen, Y., & Rui, Y. (2004). Real-time speaker tracking using particle filter sensor fusion. Proceedings of IEEE, 92(3), 485–494. Dockstader, S. L., & Tekalp, A. M. (2001). Multiple camera tracking of interacting and occluded human motion. Proceedings of the IEEE, 89(10), 1441–1455. Doucet, A., de Freitas, N., & Gordon, N. (2001). Sequential Monte Carlo methods in practice. Berlin: Springer. Han, B., Joo, S.-W., & Davis, L. (2007). Probabilistic fusion tracking using mixture kernel-based Bayesian filtering. In Proc. 11th intl. conf. on computer vision, Rio de Janeiro, Brazil. Han, B., Comaniciu, D., Zhu, Y., & Davis, L. (2008). Sequential kernel density approximation and its application to real-time visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7), 1186–1197. Han, B., Zhu, Y., Comaniciu, D., & Davis, L. (2009). Visual tracking by continuous density propagation in sequential Bayesian filtering framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 919–930. Hoffmann, C., & Dang, T. (2009). Cheap joint probabilistic data association filters in an interacting multiple model design. Robotics and Autonomous Systems, 57(3), 268–278. Isard, M., & Blake, A. (1998). ICONDENSATION: Unifying low-level and high-level tracking in a stochastic framework. In Proc. European conf. on computer vision, Freiburg, Germany (pp. 893–908). Jia, Z., Balasuriya, A., & Challa, S. (2008). Vision based data fusion for autonomous vehicles target tracking using interacting multiple dynamic models. Computer Vision and Image Understanding, 109(1), 1–21.
Int J Comput Vis Julier, S., & Uhlmann, J. (1997). A new extension of the Kalman filter to nonlinear systems. In Proceedings SPIE (Vol. 3068, pp. 182– 193). Khan, S., & Shah, M. (2006). A multiview approach to tracking people in crowded scenes using a planar homography constraint. In Proc. European conf. on computer vision, Graz, Austria (Vol. IV, pp. 133–146). Kim, K., & Davis, L. (2006). Multi-camera tracking and segmentation of occluded people on ground plane using search-guided particle filtering. In Proc. European conf. on computer vision, Graz, Austria (Vol. III, pp. 98–109). Lawson, C. L., & Hanson, B. J. (1974). Solving least squares problems. New York: Prentice-Hall. Leichter, I., Lindenbaum, M., & Rivlin, E. (2006). A probabilistic framework for combining tracking algorithms. In Proc. IEEE conf. on computer vision and pattern recognition, Washington DC (pp. 445–451). Mazor, E., Averbuch, A., Bar-Shalom, Y., & Dayan, J. (1998). Interacting multiple model methods in target tracking: a survey. IEEE Transactions on Aerospace and Electronic Systems, 34(1), 103– 123. Mccane, B., Galvin, B., & Novins, K. (2002). Algorithmic fusion for more robust feature tracking. International Journal of Computer Vision, 49(1), 79–89. Mittal, A., & Davis, L. S. (2003). M2tracker: a multi-view approach to segmenting and tracking people in a cluttered scene. International Journal of Computer Vision, 51(3), 189–203. Musicki, D. (2008). Bearings only multi-sensor maneuvering target tracking. Systems and Control Letters, 57(3), 216–221. Musicki, D., & La Scala, B. L. (2008). Multi-target tracking in clutter without measurement assignment. IEEE Transactions on Aerospace and Electronic Systems, 44(3). Okuma, K., Taleghani, A., de Freitas, N., Little, J., & Lowe, D. (2004). A boosted particle filter: multitarget detection and tracking. In Proc. European conf. on computer vision, Prague, Czech Republic, May 2004. Perez, P., Vermaak, J., & Blake, A. (2004). Data fusion for visual tracking with particle filter. Proceedings of IEEE, 92(3), 495–513. Rui, Y., & Chen, Y. (2001). Better proposal distributions: object tracking using unscented particle filter. In Proc. IEEE conf. on com-
puter vision and pattern recognition, Kauai, Hawaii (Vol. II, pp. 786–793). Salmond, D. (1990). Mixture reduction algorithms for target tracking in clutter. In SPIE signal and data processing of small targets, Orlando, FL (Vol. 1305, pp. 434–445). Sherrah, J., & Gong, S. (2001). Continuous global evidence-based Bayesian modality fusion for simultaneous tracking of multiple objects. In Proc. 8th intl. conf. on computer vision, Vancouver, Canada. Siebel, N. T., & Maybank, S. J. (2002). Fusion of multiple tracking algorithms for robust people tracking. In Proc. European conf. on computer vision, Copenhagen, Denmark (Vol. IV, pp. 373–387). Spengler, M., & Schiele, B. (2003). Towards robust multi-cue integration for visual tracking. Machine Vision and Applications, 14(1), 50–58. Triesch, J., & von der Malsburg, C. (2001). Democratic integration: self-organized integration of adaptive cues. Neural Computation, 13(9), 2049–2074. van der Merwe, R., Doucet, A., Freitas, N., & Wan, E. (2000). The unscented particle filter (Technical Report CUED/F-INFENG/TR 380). Cambridge University Engineering Department. Vermaak, J., Gangnet, M., Blake, A., & Perez, P. (2001). Sequential Monte Carlo fusion of sound and vision for speaker tracking. In Proc. 8th intl. conf. on computer vision, Vancouver, Canada (Vol. I, pp. 741–746). Vermaak, J., Doucet, A., & Perez, P. (2003). Maintaining multimodality through mixture tracking. In Proc. 9th intl. conf. on computer vision, Nice, France (Vol. II). Williams, J., & Maybeck, P. (2003). Cost-function-based Gaussian mixture reduction for target tracking. In Int. conf. information fusion (Vol. 2, pp. 1047–1054). Wu, Y., & Huang, T. (2001). A co-inference approach to robust visual tracking. In Proc. 8th intl. conf. on computer vision, Vancouver, Canada (Vol. II, pp. 26–33). Yang, C., Duraiswami, R., & Davis, L. (2005). Fast multiple object tracking via a hierarchical particle filter. In Proc. 10th intl. conf. on computer vision, Beijing, China (Vol. I, pp. 212–219). Zhong, X., Xue, J., & Zheng, N. (2006). Graphical model based cue integration strategy for head tracking. In Proc. British machine vision conference, Edinburgh, UK.