Adaptive Multiple Object Tracking Using Colour and Segmentation Cues Pankaj Kumar, Michael J. Brooks, and Anthony Dick University of Adelaide School of Computer Science South Australia 5005
[email protected],
[email protected],
[email protected]
Abstract. We consider the problem of reliably tracking multiple objects in video, such as people moving through a shopping mall or airport. In order to mitigate difficulties arising as a result of object occlusions, mergers and changes in appearance, we adopt an integrative approach in which multiple cues are exploited. Object tracking is formulated as a Bayesian parameter estimation problem. The object model used in computing the likelihood function is incrementally updated. Key to the approach is the use of a background subtraction process to deliver foreground segmentations. This enables the object colour model to be constructed using weights derived from a distance transform operating over foreground regions. Results from foreground segmentation are also used to gain improved localisation of the object within a particle filter framework. We demonstrate the effectiveness of the approach by tracking multiple objects through videos obtained from the CAVIAR dataset.
1
Introduction
Reliably tracking multiple objects in video remains a highly challenging and unsolved problem. If, for example, we aim to track several people in an airport or shopping mall, we face difficulties associated with appearance and scale changes as each person moves around. Compounding this are occlusion problems that can arise when people meet or pass by each other. This paper is concerned with improving the reliability of multiple object tracking in surveillance video. Visual tracking of multiple objects is formulated in this work as a parameter estimation problem. Parameters describing the state of the object are estimated using a Bayesian technique where the constraints of Gaussianity and linearity do not apply. In Bayesian estimation, the posterior probability density function (pdf) p(Xt |Z T ) of the state vector Xt given a set of observations Z T obtained from the camera is computed at every step, as new observations become available. Many tracking algorithms with a fixed object model have already been designed [1], [2]. However, trackers with a fixed object model are typically unable to track objects for long because of changes in lighting conditions, pose, scale and view point and also due to camera noise. Y. Yagi et al. (Eds.): ACCV 2007, Part I, LNCS 4843, pp. 853–863, 2007. c Springer-Verlag Berlin Heidelberg 2007
854
P. Kumar, M.J. Brooks, and A. Dick
One of the ways of improving object tracking has been to update the object model with the observation data. Nummiaro et al. [3] developed an adaptive particle filter tracker, updating the object model by taking a weighted average of the current and a new histogram of the object. Zhou et al. [4] proposed an observation generated by adapting the appearance model, motion model, noise variance and number of particles. Ross et al. [5] proposed an adaptive probabilistic real time tracker that updates the model using an incremental update of a so-called eigenbasis. Another way to improve the tracking of an object in video is to use multiple cues such as colour, texture, motion, shape, etc. Brasnett et al. [6] integrated colour and texture cues in a particle filter framework for tracking an object. Wu and Huang [7] investigated the relationship amongst different modalities for robust visual tracking and identified efficient ways to facilitate tracking with simultaneous use of different modalities. Spengler and Schiele [8] integrated skin colour and intensity change cues using CONDENSATION [2] for tracking multiple human faces. Perez et al. [9] proposed a multiple cue tracker for tracking objects in front of a web cam. They introduced a generic importance sampling mechanism for data fusion and applied it to fuse various subsets of colour, motion, and stereo sound for tele-conferencing and surveillance using fixed cameras. Appearance update is not factored into the approach. Shao et al. [10] improved a multiple cue particle filter by using a motion model comprising background and foreground motion parameters. R. Collins and Y. Liu [11] and B. Han and L. Davis [12] presented methods of online selection of the most discriminative feature for tracking objects. In these methods multiple feature spaces are evaluated and adjusted while tracking, inorder to improve tracking performance. The hypothesis is that the features that best discriminate between object and background are also best for tracking the object. In this paper we utilise multiple cues and object model adaptation to achieve improved robustness and accuracy in tracking. We make use of two object description cues: a colour histogram capturing appearance and spatial dimensions obtained from background-foreground segmentation capturing location and size. Object model adaptation is implemented via an autoregressive update with the region where the mode of the particles of the state vector for an object lies in the current frame.
2
Proposed Scheme
A particle filter is a special case of a Bayesian estimation process (see [13] for a tutorial on particle filters incorporating real-time, nonlinear, non-Gaussian Bayesian tracking). The key idea of a particle filter is to approximate the probability distribution of the state Xt of the object with a set of Ns particles/hypotheses and weights, as per s (1) {Xti , wti }N i=1 . Each particle is a hypothetical state of the object and the weight/belief for each hypothesis is computed using a likelihood function. Particle filter based
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
855
tracking algorithms have four main components, namely: object representation, observation representation, hypotheses generation and hypotheses evaluation. This paper proposes improvements in (a) the object and observation representation using the information obtained from background foreground segmentation, (b) hypotheses evaluation methodology. Background foreground segmentation is quite a developed technology and many real time tracking systems use it for detection of moving objects. There are algorithms to detect foreground moving objects even when the camera is gradually moving [14]. Figure 1 presents a schematic of the approach taken in this paper. The image frame obtained from the video stream is processed by background subtraction using the method presented in [15]. Each foreground blob is measured as a rectangular region, specified by centroid, width and height. A data association and merge split analysis is carried out between the objects and measurements using the method presented in [16]. A distance transform [17], [18] is applied to the foreground segmentation result. Foreground pixel intensity obtained from the distance transform is used to weight the pixel’s contribution when building the object’s histogram model. Our contention is that this gives better object and candidate representation than that obtained using other Kernel functions. The hypothesis of the object’s state is also evaluated using the measurement of an object obtained from foreground segmentation process. The beliefs from the two hypothesis evaluation processes are combined to compute the weights of the particles. The mode of the particles is then evaluated, and the state at the mode of the particles is used to update the object mode in an auto-regressive formulation. Object update is suspended for objects which have undergone a merge. 2.1
Hypothesis Generation
An object state is given by Xt = [xc , yc , W, H]T , where xc , yc are the co-ordinates of the centroid of the object and W, H are the width and height of the object in the image frame. The hypothesis generation process is also known as the prediction step, and is denoted as p(Xt+1 |Xt ). New particles are generated using a proposal function q(.), called an importance density, and the object dynamics. Using the predicted particles and hypotheses evaluation from the observation, the posteriori probability distribution of the object state is computed. We use a random walk for object dynamics for the following reasons: 1. Alternative use of constant velocity or constant acceleration object dynamics increases the dimensionality of the state space, which in turn increases exponentially the number of particles needed to track the object with similar accuracy. 2. In real life situations, especially with humans walking and interacting with other objects of the scene, it is very difficult to know the object dynamics beforehand. Different people will have different dynamics. Human motions and their interactions are relatively unpredictable.
856
P. Kumar, M.J. Brooks, and A. Dick
Hypothesis Generation
Object Dynamics
Mode of the Particles Segmentation Cue Evaluation Integration of Cues Object Measurement Data Association and Split Merge Analysis
Object Model
Colour Cue Evaluation
Distance Transform Weights
Foreground Segmentation
Visual Sensor Data
Fig. 1. This schematic highlights the flow of information in the proposed multi-cue, adaptive object model tracking method
The particles are predicted using the update Xt+1 = Xt + vt
(2)
where vt is independent identically distributed, zero mean Gaussian noise. The importance density is chosen to be the prior q(Xt+1 |Xti , Zt+1 ) = p(Xt+1 |Xti ).
(3)
The result of using this importance density is that, after resampling, the particles of the current instance are used to generate the particles for the next iteration. 2.2
Object Representation
An object is represented by its (previously specified) state Xt and a colour model. The non-parametric representation of the colour histogram of the object is P = p(u) u=1...m where m is the number of bins in the histogram. It has been argued in previous works [3], [1] that not all pixels contribute equally to object or candidate model. Thus, for example, pixels on the boundary of a region are
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
857
typically more prone to errors than the pixels in the interior of the region. A common strategy in overcoming this problem has been to use a kernel function like the Epanechnikov kernel [19] to weight the pixels’ contribution to the histogram. The same kernel function is applied irrespective of the position of the region. Our contention is that blind application of a kernel function can lead to (a) a drift problem when the object model is updated and (b) poor localisation of the object during merge. Small errors can accumulate and ultimately the target model can be completely different from the actual object. Our strategy in building the object and candidate histogram is to weight a pixel’s contribution by taking into account background-foreground segmentation information. To achieve this, the foreground segmentation result is first cleaned up using morphological operations. The Manhattan distance transform [18] [17] is then applied to get the weights of the pixels for their contribution to the object/candidate histogram. In a binary image the distance transform replaces the intensity of each foreground pixel with the distance of that pixel to its nearest background pixel. Thus, centrally located pixels (in the sense of being further from the background) receive greater weight and pixels on the boundary separating foreground-background receive small weights. The distance transform appears to be better suited for this purpose than more traditional kernel functions. Scores p(u) of the bins of the histogram model of the object, P = p(u) u=1...m , are computed using the following equation w(xj ) δ(g(xj ) − u), (4) p(u) = xj ∈F oreground Region
where δ is the Kronecker delta function, g(xj ) assigns a bin in the histogram to the colour at location xj , and w(xj ) is the weight of the pixel at location xj obtained on application of the distance transform to the foreground segmented region. The weights for background pixels are almost zero, which makes it very unlikely that the tracker will shift to background regions of the scene. When two or more objects merge, it is detected using a merge-split algorithm [16], and the updating of the object model is temporarily halted. 2.3
Observation Representation
To estimate the posterior probability of the state of the object, Ns hypotheses of the object are maintained (recall eq. (1)). Each hypothesis gives rise to an observation representation which is used to evaluate the likelihood of it being the tracked object. The histogram for a hypothesised region using the current image frame is Q = q (u) u=1...m , where m is the number of bins in the histogram, is defined to be q (u) = w(xj ) δ(g(xj ) − u), (5) xj ∈F oreground Region
analogously to eq.(4). The observation from foreground segmentation are the centroid, width and height of different foreground blobs in the current frame.
858
P. Kumar, M.J. Brooks, and A. Dick
m Nearest-neighbour data association is used to associate a measurement xm c , yc , m m W , H to an object. In the event there is a merger of objects, then the centroid, for evaluation of a hypothesis, is computed as the weighted mean of the foreground pixels in the region defined by the hypothesis/particle. The weights used are from the distance transform applied to the foreground segmentation result.
2.4
Hypothesis Evaluation
Each hypothesis for the object state is evaluated using colour information and foreground information. A likelihood function [6] is used to compute the weight of each particle, integrating the colour and foreground cues as follows: L(Zt |Xti ) = Lcolour (Zc,t |Xti ) × Lf g (Zf g,t |Xti ),
(6)
where Zc,t is the current frame and Zf g,t is the measurement from current frame, after foreground segmentation and eight-connected analysis, associated with the object. Here, (7) Lcolour (Zc,t |Xti ) = exp (−dc (Pt , Qt )/σZc ), 1 − ρ(P, Q) is the Bhattacharyya distance based on the Bhatdc (P, Q) = m (i) (i) tacharyya coefficient, ρ(P, Q) = i=1 p q and σZc is zero mean Gaussian noise associated with colour observation. The term Lf g (Zf g,t |Xti ) is the likelihood based on the foreground segmentation measurement and is given by Lf g (Zf g,t |Xti ) = exp (−df g (Xti , Xtm )/σzf g ), where df g (Xti , Xtm ) = (1 − exp (−λ) and λ=[
2 i m 2 i m 2 i m 2 (xic − xm c ) + (yc − yc ) + (Wc − Wc ) + (Hc − Hc ) ] Wci × Hci
(8)
(9)
when there is a match for the the object by data association. In the case of 2 i M 2 i i a merge df g (Xti , Xtm ) = (1 − exp (−[((xic − xM c ) + (yc − yc ) )/(Wc × Hc )])) M where xM , y is the weighted centroid of the foreground pixels in the region c c defined by particle Xti . For meaningful balanced integration of cues, the functions for dc and df g should have similar behaviour. To test the distance function behaviour we plotted dc against (1 − ρ(P, Q)) = [0, 1], where ρ = 1 means best match and ρ = 0 means worst match of histograms P and Q. Simultaneously we plotted df g for λ = [0, 2], where value zero means a good match and value two means a bad match. Figure 2 shows that the plots of the two distance functions are very similar. 2.5
Model Update
To handle the appearance change of the object due to variation in illumination, pose, distance from the camera, etc., the object model is updated using the auto-regressive learning process Pt+1 = (1 − α)Pt + αPtest .
(10)
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
859
Fig. 2. Left is the plot of dc against (1 − ρ(P, Q)) on x-axis, for matching colour observation. Right is the plot of df g against λ on x-axis. Both plots are quite similar.
Here Ptest is the histogram of the region defined by the mode of the particles used in tracking the object, and α is the learning rate. The higher the value of α the faster the object model will be updated to the new region. The model update is applied when the likelihood of the current estimate of the state of the object Xtest , with respect to the current measurement Zt , given by L(Zt |Xtest ) = Lcolour (Zc,t |Xtest ) × Lf g (Zf g,t |Xtest )
(11)
is greater than an empirical threshold.
3
Results
Figure 3 shows the tracking result from a video in the CAVIAR data set. The person on the left of the frame undergoes significant scale and illumination change. As the person walks past the shop window the illumination on the person changes and hence there is significant change in appearance. This is evident from the model histogram plots of the object for the different instances of time in Figure 4. This shows the colour model for the person tracked with dashed bounding box in Figure 3. In such a case an ordinary colour tracker will have the problem of large errors in localisation and an adaptive colour tracker is likely to drift to other parts of the scene. The tracker proposed in this paper tracks the object accurately throughout the duration of the video. Figure 5 shows the improvement in localisation of two targets brought about when there is overlap of targets. Figure 5a shows the tracking result just by colour cue. Since the colour of the two targets are different, the mode of the particles precludes converging to positions that include parts of the other object. Under such circumstances if the tracker is adaptive then it is quite possible that it will drift to other parts of the scene than the object of interest. Incorporation of cues from foreground segmentation gives better localisation of the targets in case of overlap as is evident from Figures 5b and 6. Figures 6 and 7 shows some more tracking results. These two sequences are particularly difficult because there are instances of long and complete occlusions of targets by each other. In the former sequence there is a case of occlusion which lasted for 280 frames. In the latter sequence the sizes of the objects are small, there are partial occlusions from background objects and noise level is high. The complete tracking results can be downloaded from http://www.cs.adelaide.edu.au/ ˜vision/projects/accv07-tracking/.
860
P. Kumar, M.J. Brooks, and A. Dick
Fig. 3. These frames show successful tracking of three objects in a video from the CAVIAR data set
Fig. 4. The images show the RGB histogram model of the person on the left, for three different instances as tracking progresses. Because of the change in illumination due to shop windows there is significant change in appearance and hence the object model.
Fig. 5. The left image shows the poor localisation of the object when only the colour cue is used for tracking. The right image shows the improved localisation of the object when both colour and segmentation cues are integrated for tracking.
The proposed approach to tracking is more reliable for tracking objects when there are changes in scale, pose, illumination, and occlusion. In our experiments we have been able to track objects with as few as 20 particles. However, two drawbacks which were observed in the proposed method of tracking: (1) When shadows are detected as foreground then the localisation of the object is less accurate. This can be improved by using shadow removal
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
861
Fig. 6. These frames show successful tracking of objects in spite of almost complete occlusion
Fig. 7. These frames show successful tracking of people in spite of poor illumination, small size, and several occlusions. Left and middle images show tracking of two persons. Right images shows tracking of the three persons present simultaneously in the scene.
methods; (2) During almost complete occlusions there are errors in localisation but correct tracking resumes when objects separate after occlusion. Correct localisation of occluded target during complete occlusion with a single sensor is a very difficult problem. Given the unconstrained environment of real life situations in the CAVIAR dataset. quite good tracking results are obtained by the scheme presented in the paper.
4
Conclusion
An enhanced scheme for tracking multiple objects in video has been proposed and demonstrated. Novel contributions of this work include a new weight function for construction of the object and candidate model. The measurement obtained from foreground segmentation is integrated with a colour cue to achieve better localisation of the object. Sometimes there are errors in segmentation and sometimes the colour cue is not reliable, but integration of the two cues gives a better result. The proposed method improves handling of object models undergoing change, rendering the system less susceptible to the drift problem. Furthermore the tracker can follow an object with as few as 20 particles. The method can be extended to moving cameras by using optical flow, mosaic or epipolar constraint techniques to segment the moving foreground objects.
862
P. Kumar, M.J. Brooks, and A. Dick
References 1. Dorin, C., Visvanathan, R., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 2. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 3. Nummiaro, K., Koller-Meier, E., Gool, L.J.V.: Object tracking with an adaptive color-based particle filter. In: Proceedings of the 24th DAGM Symposium on Pattern Recognition, pp. 353–360. Springer, Heidelberg (2002) 4. Zhou, S.K., Chellappa, R., Moghaddam, B.: Visual tracking and recognition using appearance adaptive models in particle filters. IEEE Transactions on Image Processing 13(11), 1434–1456 (2004) 5. Ross, D., Lim, J., Yang, M.H.: Adaptive probabilistic visual tracking with incremental subspace update. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3022, pp. 470–482. Springer, Heidelberg (2004) 6. Brasnett, P.A., Mihaylova, L., Canagarajah, N., Bull, D.: Particle filtering with multiple cues for object tracking in video sequences. In: Proceedings of SPIE. Image and Video Communications and Processing, vol. 5685, pp. 430–441 (2005) 7. Wu, Y., Huang, T.S.: Robust visual tracking by integrating multiple cues based on co-inference learning. Int. J. Comput. Vision 58(1), 55–71 (2004) 8. Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking. Machine Vision and Applications 14, 50–58 (2003) 9. Perez, P., Vermaak, J., Blake, A.: Data fusion for visual tracking with particles. Proceedings of the IEEE 92(3), 495–513 (2004) 10. Shao, J., Zhou, S.K., Chellappa, R.: Tracking algorithm using background foreground motion models and multiple cues. In: ICASSP apos 2005. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 233–236 (2005) 11. Collins, R.T., Liu, Y., Leordeanu, M.: Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence 27(10), 1631–1643 (2005) 12. Han, B., Davis, L.: Object tracking by adaptive feature extraction. In: ICIP 2004. International Conference on Image Processing, vol. 3, pp. 1501–1504 (2004) 13. Arulampalam, S., Maskell, S., Gordon, N., Clapp, T.: A tutorial on particle filters for on-line non-linear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing 50(2), 174–188 (2002) 14. Kang, J., Cohen, I., Medioni, G., Yuan, C.: Detection and tracking of moving objects from a moving platform in presence of strong parallax. In: Proceedings of the Tenth International Conference on Computer Vision, Beijing, China, vol. 1, pp. 10–17 (2005) 15. Kumar, P., Ranganath, S., Huang, W.: Queue based fast background modelling and fast hysteresis thresholding for better foreground segmentation. In: The Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint Conference of the Fourth International Conference on Information, Communications and Signal Processing, vol. 2, pp. 743–747 (2003) 16. Kumar, P., Ranganath, S., Sengupta, K., Huang, W.: Cooperative multitarget tracking with efficient split and merge handling. IEEE Transactions on Circuts and Systems for Video Technology 16(12), 1477–1490 (2006)
Adaptive Multiple Object Tracking Using Colour and Segmentation Cues
863
17. Jain, A.K.: Fundamentals of Digital Image Processing. Prentice Hall International, Englewood Cliffs (1989) 18. Rosenfeld, A., Pfaltz, J.: Distance functions in digital pictures. Pattern Recognition 1, 33–61 (1968) 19. Dorin, C., Visvanathan, R., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conference on Computer Vision and Pattern Recognition, Hilton Head Island, SC, USA, vol. 2, pp. 142–149 (2000)