ing a Joint Probability Data Association Filter (JPDAF) for integrating multiple cues from multiple sensors, It pro- vides an unified framework for fusing information ...
Joint IEEE Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS’04) In conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR’2004)
Detection and Tracking of Moving Objects from Overlapping EO and IR Sensors Jinman Kang, Kalpitkumar Gajera, Isaac Cohen and Gérard Medioni IRIS, Computer Vision Group, University of Southern California Los Angeles, CA 90089-0273 {jinmanka|gajera|icohen|medioni}@iris.usc.edu
Abstract We present an approach for tracking moving objects observed by EO and IR sensors on a moving platform. Our approach detects and tracks the moving objects after accurately recovering the geometric relationship between different sensors. We address the tracking problem by separately modeling the appearance and motion of the moving regions using stochastic models. The appearance of the detected blobs is described by multiple spatial distribution models of blobs’ colors and edges from different sensors. This representation is invariant to 2D rigid and scale transformation. It provides a rich description of the object being tracked and produces an accurate blob similarity measure for tracking - especially when one of sensors fails to provide reliable information. The motion model is obtained using a Kalman Filter (KF) process, which predicts the position of the moving objects while taking into account the camera motion. Tracking is performed by the maximization of a joint probability model reflecting appearance and motion. The novelty of our approach consists in defining a Joint Probability Data Association Filter (JPDAF) for integrating multiple cues from multiple sensors, It provides an unified framework for fusing information from different types of sensors. The proposed method tracks multiple moving objects with partial and total occlusions under various illumination conditions. We demonstrate the performance of the system on several real video surveillance sequences.
1. Introduction Video surveillance system is one of the most popular applications of video processing research. The fundamental elements of video surveillance are object detection and tracking. To achieve reliable surveillance performance, the simultaneous and consistent solution for the detection and tracking of moving objects are required. In real world, due to the various environmental conditions, such as different illumination, shadow, reflection and highlight, a single solution based on a single modality tends to fail in most of the cases. In addition, the usage of a moving platform makes the problem even more challenging as we have to
0-7695-2158-4/04 $20.00 (C) 2004 IEEE
keep track of the camera’s motion in order to ensure accurate background model and consistent tracking. To handle different environmental conditions, researchers proposed the use of multiple sensor models such as Electro-Optical (EO) and Infra-Red (IR) sensors. While multiple sensors provide a better sensing in various conditions, it requires an efficient fusion of the data for a better understanding of the activity. The crucial elements for tracking across multiple sensors are the registration of multiple sensors, and integrating multiple cues across sensors. In the following sections, we review related work and present an overview of the proposed approach. Section 2 presents the registration of multiple sensors. In Section 3, we introduce the stochastic models used for tracking moving objects. Obtained results are presented and discussed in Section 4. Finally, Section 5 concludes our paper with a discussion on future work.
1.1. Previous Work The registration of images acquired by different sensors is a challenging task, as it requires space and time registration of each camera with respect to the motions of the sensors [9][10]. Several methods have been proposed on the registration of image sequences acquired by various sensors [4][11][13][17]. These approaches vary by the use of contour features for an invariant image representation [4], iterative optimization relying on local invariant representation [11], similarity measure relying on mutual information [13], and the global maximization of an averaged local correlation measure using Fisher’s Z-Transformation [17]. Although the multi-sensor registration has been studied by many researchers, tracking across sensors of various modalities did not get the same attention. There are several issues that must be addressed to achieve simultaneous tracking of multiple moving objects. Partial or total occlusions of moving objects are a hard problem. The tracked object may be invisible from the scene temporarily either due to the motion of the object or the camera itself. To propagate information across different sensor models, a fusion method is required for integrating multiple cues simultaneously. A large number of papers have been published on video tracking [8][2]. The main limitation of these approaches is
Joint IEEE Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS’04) In conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR’2004) the lack of adaptive and invariant object description. For tween the predicted bounding box position and the boundexample, scene dynamics and occlusions of short or long ing box position of the observed blobs by encoding the duration. To address this problem, each object has to be camera motion in the KF. represented by a persistent object appearance model. An object appearance model is represented by a set of distinc- 2. Registration of multi-sensors sequences tive features such as color [5][1], shape [16], or texture. The problem of multi-sensors image sequences registraMethods based on active contours [16] usually require tion requires selecting features that are invariant across initializing the contour manually, and only handle small sensors for a simultaneous alignment in spatial and temponon-rigid motion. The color based methods proposed so far ral domain. We use gradient information for extracting are usually not invariant to arbitrary rigid motions, or re- consistent features across different sensors (see Figure 1). quire a segmentation of detected blob into well known regions: for humans a segmentation into regions corresponding to the head, torso, and legs. Appearance changes are expected while tracking moving people in the scene. Therefore, object appearance models have to be continuous in the sense that a small localized change of the object color and shape should create a small localized variation in (a) (b) its signature. The object description should also be invariant to 2D rigid transformation and scale change in order to accommodate small change of perspective. In [15], authors proposed to use a set of discriminative features for tracking by finding an optimal feature for distinguishing foreground and background. Although the proposed method provides the optimal discriminative tracking feature, it is only lim(c) (d) ited to tracking of a single object. In the case of multiple Figure 1. Example of consistent feature across different moving objects, the background model is biased and the sensors (a) A selected frame from EO sensor. (b)A correselected features are not optimal. sponding frame from IR sensor. (c) and (d) Edge maps of Tracking moving objects from multiple cameras is usu- the selected EO and IR frames respectively.. ally based on a prior registration of the cameras using In most of the cases, the temporal constraint is assumed common scene features or tracked moving objects. Image to be known, and the spatial constraint is assumed to be or video registrations have been extensively studied and constant. However, in this study, we did not have access to one can find a summary in [12] and [14] respectively. the calibration information of the rig and we had to register the geometrically the sensors using image features. Al1.2. Overview of the Proposed Approach though the sensors are tightly mounted on rigid structure In this paper, we propose a novel approach for simulta- (e.g. camera rig), spatial variation can be observed frame neous tracking moving object across a heterogeneous set of by frame due to small errors in of the estimated perspective sensors. Joint Probability Data Association Filter (JPDAF) projection registering the views. based tracking of moving objects is proposed to provide The geometric registration of the sensors is performed persistent object tracking across multiple moving sensors by a combination of a perspective and affine transformaby encoding objects’ appearance and motion as well as the tions using the set of extracted invariant features. The permotion of the sensors. spective transformation is estimated from the first frame of The detection from the moving camera is performed us- each sensor, and is used as an initial registration. The affine ing the approach proposed in [10], which derives a back- transformation is estimated from pairs of frames across ground model while controlling the accumulated residual sensors to estimate spatial drift between sensors. The optierrors by using a sliding window approach. mal transformation is selected by measuring registration The appearance of detected moving blobs from various errors from both transformations. sensors is encoded using a radial distribution invariant to 2D rigid transformation and to scale variation. This distri- 3. Tracking using JPDAF bution encodes either color or edge properties of the blob, Tracking is formulated as a joint maximization of a set or both of them depending on the availability of the inforof probability models. Two types of probabilistic models mation. An appearance probability model is defined based are defined and represent objects’ appearance and velocity. on a similarity function (Kullback-Leibler distance) measThe appearance model should be able to describe object’s uring the likelihood between two distributions. The motion appearance in various sensors simultaneously. In the case probability model is inferred from a Kalman Filter (KF). of EO, the distinctive components for describing object’s This model is calculated by a Gaussian distribution beappearance are color and edge information. However, in IR
0-7695-2158-4/04 $20.00 (C) 2004 IEEE
Joint IEEE Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS’04) In conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR’2004) ∑ E (P ) domain, due to the lack of chromatic information, edge E (r ) = (1) max ( ∑ E ( P ) ) information is the dominant cue. The combination of EO and IR appearance allows to measures the similarity of th blobs’ appearance more accurately. The motion model uses where, E r ( Pi ) is the number of edge points for the r rathe KF for modeling the kinematics of the moving objects dial bin of the i th control point, and E EO (r ) is the edge and camera motion simultaneously. distribution for each radial bin for EO sequence. r
i
i
EO
r
3.1. Appearance Model The color distribution model is obtained by mapping the blob into multiple polar representations. Several shape or color distribution models using a polar representation have been proposed [6][7][10]. In [6], the proposed approach is focused on the object’s shape description (edge) instead of their appearance (color), and it is only limited to representing local shape properties. In [7], the proposed model measures color distribution using a similar polar representation, but focuses on characterizing a global appearance signature of the object. The model is not 2D rotationinvariant and we propose here to use the shape description model proposed in [10] for guaranteeing invariance to 2D rigid transformation and scale change (see Figure 2).
r
i
i
The shape-based description of the blob from IR sensor is obtained similarly from equation (1) by using E IR (r ) the edge distribution for each radial bin of the IR sequence.
Figure 3. Examples of the proposed appearance model
Figure 2. Mapping of the proposed 2D rigid motion invariant polar representation Given a detected moving blob, we define a reference circle C R corresponding to the smallest circle containing the blob. This circle is uniformly sampled into a set of control points Pi . For each control point Pi a set of concentric circles of various radii are used for defining the bins of the appearance model. As illustrated in Figure 2, we sampled the reference circle with 8 control points. The larger number of control points along the reference circle make the model rotation invariant. The defined model is also translation invariant. Finally, normalizing the reference circle to unit circle guarantees scale invariance. Inside each bin, a Gaussian color model is computed for modeling the color properties of the overlapping pixels of the detected blob. Therefore, for a given control point Pi we have a one-dimensional distribution γ i ( Pi ) . The normalized combination of the distributions obtained from each control point Pi defines the appearance model of the detected blob: Λ i = ∑ γ i ( Pi ) . The shape-based description of moving blobs detected in the EO sensor is obtained similarly by counting in each bin the number of edge pixels belonging to the moving blob. The 2D shape description is obtained by collecting and normalizing corresponding edge points for each bin as follows:
0-7695-2158-4/04 $20.00 (C) 2004 IEEE
As one can observe from Figure 3, the combination of the various sensor models provides more accurate description of the moving object. Especially, the edge information obtained from IR sequence provides more detailed description where EO sequence is not able to provide it due to the illumination condition (e.g. twilight). In the following we describe the use of the proposed appearance descriptor for deriving an appearance probability model for tracking moving objects. The appearance probability model is defined as a similarity measure among detected blobs in successive frames. The proposed appearance model describes blobs through a distribution function. We employ Kullback-Leibler distance for measuring the similarity of the computed appearance models. Due to the different distribution models, Gaussian distribution for the color model and uniform distribution for the shape model, the similarity measurements are computed separately. The similarity function of the color model can be expressed in terms of the mean and variance of the Gaussian model in each bin. Given the estimated Gaussian models in each bin we can define a likelihood ratio defined by equation (2): Γ Color =
1 2 N rgb
∑ ( µ N rgb
t
1 1 − µ t +1 ) 2 ⋅ + σ 2 σ t2 t +1
σ t2 σ 2 + + t + 12 σ 2 σ t t +1
(2)
where, µ t and σ t are respectively the mean and the variance of the color component in the considered bin, and N rgb is total number of bins of the color component, and 1 ≤ Γcolor ≤ ∞ . The similarity measurement of the shape model is obtained by equation (3). E EO (r ) t is the shape probability
distribution of each bin of the EO sequence at time t, and
Joint IEEE Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS’04) In conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR’2004) In [10], a JPDAF-based tracking approach was proposed E IR (r ) t is the shape probability distribution of each bin of by formulating the tracking problem as finding an optimal the IR sequence. E (r ) position ( Xˆ ) of the moving object by maximizing both 1 Dist = ∑ ( E ( r ) − E ( r ) ) log E (r ) 2 appearance and motion models. The optimal position at (3) 1 E (r ) each time step depends on the current observation, as well Dist = ∑ ( E ( r ) − E ( r ) ) log 2 E (r ) as the past estimated state vectors. A joint probability of The probability of the appearance model is obtained by the state vector at time t is briefly summarized by the folequation (4) combining both descriptions: lowing equations: Papp = 1 2 2 2 (4) Ptotal ( X t ) = P ( A t , X t ) P ( X t | Xˆ t − 1 , " , Xˆ 1 , Xˆ 0 ) Ptotal ( Xˆ t − 1 ) ( ΓColor ) + ( Dist EO _ Shape ) + ( Dist IR _ Shape ) (8) = P ( X )P (X )P ( Xˆ ) EO _ Shape
EO
t
EO
t +1
EO
r
t
t +1
EO
IR _ Shape
IR
t
IR
r
IR
t +1
IR
t
t +1
app
3.2. Motion Model
We have proposed in [10] a unified KF framework, which can handle simultaneously the motion of the moving object and the camera motion. The KF model uses an approximation of the camera motion estimated by an affine transformation and also provides automatic camera handoff and tracking of objects occluded for short periods of time. The detailed description of the proposed motion model is addressed in [10]. The KF framework is used for calculating the motion probability model of detected moving objects. The state vector considered in this paper is defined by the following vector: i i i i i i i i (5) x i = ( x top , y top , v top , u tip , x bottom , y bottom , v bottom , u bottom ) i i where, ( xtop , y top ) is the top-left corner of the detected i i bounding box, ( xbottom , y bottom ) is the bottom-right corner of i i the detected bounding box, (vtop , u top ) is the 2D velocity i i i i of ( xtop , y top ) , and (vbottom , u bottom ) is the 2D velocity of i i ( xbottom , y bottom ).
The motion probability model P~motion is calculated by a multi-variable Gaussian (normal) distribution of the motion estimates using the following equation: ~ P motion
=
(( 2 π )
N
s
− 1 e • det( P t )) 1 / 2
( x t − x t ) T • Pt − 1 • ( x t − x t ) 2
(6)
where, N s is the number of variables in the object’s state vector, xt is the covariance mean obtained from KF, xt is the observed position, and Pt is the covariance matrix obtained from the KF. Each sensor provides a separated motion probability model. The considered motion probability is obtained by combining the estimated motion from each sensor as follows: ~ P EO
~ + P IR
(7) ~ where PEO _ motion is the motion probability model derived ~ is the motion probability from the EO sensor and P IR _ motion P motion
=
_ motion
_ motion
2
model computed from the IR sequence.
3.3. Joint Probability Model
0-7695-2158-4/04 $20.00 (C) 2004 IEEE
t
motion
t
total
t −1
where, X t denotes a position observation at time t, and At denotes the appearance observation at time t. In order to simultaneously fuse information from the EO and IR sensors, the joint probability model should be rewritten as follows: P total ( X
t EO
• P (X • P total
t IR
, X
= P (C t , E
t EO
, E
t EO
, X
( Xˆ
t −1 EO
= P app ( X
t EO
) t IR
t | X EO , X IRt ) t −1 | Xˆ EO , Xˆ IRt − 1 , " , Xˆ
t IR
, Xˆ
, X
t IR
t −1 IR
0 EO
, Xˆ
0 IR
(9)
)
)
) • P motion
(X
t EO
, X
t IR
) • P total ( Xˆ
t −1 EO
, Xˆ
t −1 IR
)
where, Ct denotes a color observation at time t in the EO t sensor, E EO and E IRt denote respectively the edge obsert and X IRt vation in the EO and IR sensors. Similarly, X EO denote respectively the position observation at time t in the EO and IR sequence. In order to avoid the accumulation of products of probability, we propose to consider the log of the probabilities as the joint probability. Furthermore, for ensuring a stable calculation, we discard old measurements from the estimation process. This shortens the memory of the KF and tolerates variations in speed and color similarities.
4. Experimental results In this section, we present some results obtained from the real sequences for illustrating the continuous tracking of moving objects across EO and IR sensors. We first start with a pair of ground view sequence in Figure 4. As one can observe from the figure, the chromatic and edge information from EO sequence is very poor due to the illumination condition. In this example, IR information provides better shape information, and the combination of sensors improves the performance of the tracking. In Figure 4.a, we illustrate object reacquisition after the tracked object was temporarily outside of the camera’s FOV. The consistent labelling of the tracked objects is depicted by the use of similar colors for the bounding boxes. Occlusions by stationary or moving objects are also correctly handled by the tracker. In Figure 4.b we show several frames illustrating the capability of handling stationary and dynamic occlusions. In Figure 5 we present a second example illustrating the ability to improve tracking results by fusing information from different sensors. We use two video streams acquired by an UAV.
Joint IEEE Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS’04) In conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR’2004)
(a)
(b) Figure 4. Simultaneous acquisition of EO and IR data from a hand held rig. (a) Continuous tracking of moving object when the tracked object is temporary out of FOV (b) Tracking of multiple moving objects with stationary and dynamic occlusions. The EO and IR sensors are mounted on a gimbal. None over different types of occlusions. Other issues remain to of the sensors provide a very good detection of the moving be addressed such as: the integration of 3D information objects. This is due to various factors such as the color (e.g. 3D ground plane, 3D trajectory and 3D structure) for similarity (Figure 5.a) and the lack of significant tempera- improving both detection and tracking performance. The ture variation (Figure 5.b) between the foreground and the integration of other type of sensor (e.g. LIDAR) could also background regions. The tracking of the moving objects be investigated for assessing the performance improvement from the EO sensor uniquely does not provide good results provided by 3D data. as shown in Figure 5.c. The last frame in Figure 5.c shows the inability of the tracking in the EO sensor to estimate Acknowledgements correctly the shape of the moving objects due to the lack of This research was partially funded by the Advanced Revariation with the background. Using the proposed ap- search and Development Activity of the U.S. Government proach for fusing the two sensors, we can robustly track the under contract MDA-904-03-C1786. two people (see Figure 5.d). We have evaluated the proposed approach by extracting References the color and location of the bounding box of each object [1] A. Elgammal and L. S. Davis, “Probabilistic Framework for manually, and comparing it to the estimated positions proSegmenting People Under Occlusion”, IEEE ICCV, 2001. vided by the method. We have observed that the moving [2] C. Wren, A. Azarbayejani, T. Darrell and A. Pentland, objects are correctly reacquired and their positions are “Pfinder: Real-time tracking of the human body”, IEEE accurately estimated. Trans. on PAMI, Vol. 19(7), pp. 780-785, 1997.
5. Conclusion We have presented a novel approach for tracking continuously multiple objects across multiple sensors using the joint probability model, which encodes the object’s appearance and motion. The appearance model is invariant to 2D rigid transformation and scaling, and it encodes both EO and IR object’s appearance. It is used to accurately measure the appearance’s similarity regardless of the blobs’ rigid motion. As depicted in the experimental results, moving objects are consistently tracked by integrating IR information when EO sensor fails to provide a reasonable estimation. The motion model encodes not only the objects’ motion, but also the cameras’ motion: It tracks moving objects
0-7695-2158-4/04 $20.00 (C) 2004 IEEE
[3] G. Stein, “Tracking from Multiple View Points: Self-calibration of Space and Time”, IEEE CVPR, 1999. [4] H. Li, B.S. Manjunath and S.K. Mitra, “A contour based approach to multisensor image registration”, IEEE Trans. on Image Processing, 1995. [5] H. Roh, S. Kang and S. Lee, “Multiple People Tracking Using an Appearance Model Based on Temporal Color”, IEEE ICPR, 2000. [6] H. Zhang and J. Malik, “Learning a discriminative classifier using shape context distance”, IEEE. CVPR, 2003. [7] I. Cohen and H. Li, “Inference of Human Postures by Classification of 3D Human Body Shape”, IEEE IWAMFG, 2003. [8] I. Haritaoglu, D. Harwood and L. S. Davis, “Hydra: Multiple people detection and tracking using silhouettes”, 2nd IEEE VS Workshop, 1999.
Joint IEEE Workshop on Object Tracking and Classification Beyond the Visible Spectrum (OTCBVS’04) In conjunction with IEEE Computer Vision and Pattern Recognition Conference (CVPR’2004) [9] J. Kang, I. Cohen and G. Medioni, “Continuous Multi-Views Tracking using Tensor Voting”, IEEE Motion Workshop, 2002. [10] J. Kang, I. Cohen and G. Medioni, “Continuous Tracking Within and Across Camera Streams”, IEEE CVPR, 2003. [11] M. Irani and P. Anandan, “Robust Multi-Sensor Image Alignment”, IEEE ICCV, 1998. [12] M. Irani and P. Anandan, “About direct methods”, Vision Algorithm Workshop, 1999. [13] P. Viola and W. M. Wells, “Alignment by maximization of mutual information”, IEEE ICCV, 1995.
[14] Q. Cai and J. Aggarwal, “Automatic Tracking of Human Motion in Indoor Scenes Across Multiple Synchronized video Streams”, IEEE ICCV, 1998. [15] R. T. Collins and Y. Liu, “On-Line Selection of Discriminative Tracking Features”, IEEE ICCV, 2003. [16] Y. Rui and Y. Chen, “Better Proposal Distributions: Object Tracking Using Unscented Particle Filter”, IEEE CVRP, 2001. [17] Y. Sheikh and M. Shah, “Aligning Dissimilar Images Directly”, ACCV, 2004.
(a)
(b)
(c)
(d) Figure 5. Images from EO an IR sensors mounted on a gimbal and acquired by a UAV (yellow circle : region of interest) (a) Detection results from the EO sensor (b) Detection results from the IR sensor (c) Tracking results from the EO sensor (d)Tracking obtained by the proposed method fusing EO and IR object characteristics.
0-7695-2158-4/04 $20.00 (C) 2004 IEEE