"Wallflower: Principles and Practice of Background Maintenance", In. ICCV, pp. 255-261 1999. (a) Two views of the 2D+t boxes (366). (b) Filtered boxes (288).
A 2D+t Tensor Voting based approach for Tracking Pierre Kornprobst and Gerard Medioni IRIS, USC Computer Vision Group University of Southern California Los Angeles, CA 90089-0273. {kornprob,medioni}@iris.usc.edu
Abstract This paper presents a new approach to track objects in motion when observed by a fixed camera, with severe occlusions, merging/splitting objects and defects in the detection. We first detect regions corresponding to moving objects in each frame, then try to establish their trajectory. The method is based on a spatiotemporal (2D+t) representation of the moving regions, and uses the Tensor Voting methodology to enforce smoothness in space and time of the tracked regions. We demonstrate the performance of the system on several real sequences.
1
Introduction
This paper deals with tracking of moving regions in the case of a fixed observer. Classically, two steps are involved: detection and tracking. Detection is a fairly well understood, though hard problem, and many approaches have been developed, often based on background modeling and foreground segmentation [18][10][12][14]. Tracking is still an active field of research (see for instance [15]) due to the variety of problems that may arise in real sequences: partial or total occlusions, noise, shadows and reflections. Two main classes of approaches may be distinguished. We call the first one model-based when some a priori knowledge is available [11][3][6]. The second one are image based. They establish correspondences between regions based on the intensity information (corners, edges, color histograms,...) [1][2][4][5][9]. In the absence of model, it becomes necessary to impose other constraints, such as smoothness. The difficulty lies in the implementation of such a generic constraint. Global optimization (e.g. [16]) often leads to combinatorial explosion. Extended Kalman Filter, which works very well for ballistic trajectories, may fail with discontinuous ones. We focus in this article on the tracking problem for a sequence from a static camera, with a given temporal size. To detect the objects in motion, we use the approach developed in [12] in the context of noisy sequences. It is a partial differential equation (PDE) based approach which
restores the sequence and segments it in a coupled way. A theoretically justified optimization problem is proposed. Once we have these regions, the aim is to correct detection defects due to occlusions, low contrast and noise, and to extract the trajectories. This is done using a perceptual grouping approach, the Tensor Voting methodology. It is based on a tensorial description of the moving regions, which encodes several different aspects: the estimated velocity (as the principal orientation of the tensor), the number of neighbors involved in this estimation, and their coherence (called the saliency). Beside introducing a new methodology for tracking, we show how the classical Tensor Voting methodology can take into account non geometrical informations. The paper is organized as follows. Section 2 presents the approach. We show in Section 3 promising results obtained on real sequences. Section 4 sketches future extensions of this work.
2
Voting for tracking
2.1 A 2D+t representation Figure 4 shows some frames from the sequence we use as a running example. The input to our system is a set of detected regions, represented by their bounding boxes, and characterized by: (D-1) the centers of the bounding t
boxes C i = ( x i, y i, t ) , written as 2D+t points, (D-2) the bounding boxes (their dimensions), (D-3) the set of moving points in the boxes. We show in Figure 4 (a) two different views of the centers
t
C i . Thanks to this
representation, the notion of “3D” trajectories clearly appears. The problem is then to infer some property of continuity for the position and the size of the boxes. The originality of this work is to turn the tracking problem into a perceptual grouping problem, in space and time.
2.2 Tensor Voting framework The goal of perceptual organization is to be able to describe what we are able to perceive, even with noise or
missing data, when classical image based approaches fail. Many approaches have been developed (see [13] for a discussion) and they all try to encode the smoothness constraint enunciated by Marr. Recently, the Tensor Voting methodology, a unified framework for many problems, has shown some promise. Although this problem can be seen as a direct application of the generic 3D system described in [17], it is indeed quite different. We are only interested in curves (and not surfaces or junctions). The 2D+t space is anisotropic (the time plays a special role). Not only the position of the points (boxes) is important but also their associated moving regions. The positions may also be biased because of occlusions. We explain below the general framework for this problem. Two main issues merit discussion. (A) Description: position and velocity of a box are represented as 3D tensors. At the beginning, a box, having no velocity estimation, is encoded by the identity tensor, called ball tensor in reference to its geometrical representation. If we know the velocity of a box v = ( v x, v y, 1 )
definition of F, and ϒ is the space of symmetric definite positive tensors. If M′ ∈ D M is a receiver, then we will
(written in homogeneous coordinates), it is represented by the stick tensor:
Voting token Receiver Inferred Velocity Figure 1. Tensor fields and trajectories. Left: linear trajectory. Right: parabolic trajectory. The second issue is related to the strength, the weight of a vote. It may depend on the following criteria, which will be combined in a multiplicative way: (W-1) the temporal distance between the voting token and the receiver, (W-2) the saliency of the voting token, (W-3) acceleration in the case of parabolic trajectories, (W-4) the overlap of the region associated to the voting token with the regions around the receiver. Given this general framework, the next section sketches the different steps involved and mention which criteria are used.
2
vx vxvy vx T
v v = v v v2 v , x y y y vx
(1)
vy 1
which has only one non zero eigenvalue with v its the associated eigenvector. More generally, any velocity will be encoded by a 3D symmetric definite symmetric definite positive positive tensor tensor with three eigenvalues λ max ≥ λ mid ≥ λ min ≥ 0 , and represented by an ellipsoid. Its “shape” encodes the estimated velocity (as the eigenvector associated to λ max ), the number of neighbors involved in this estimation ( λ max ), and their coherence, called the saliency. The saliency is defined by: S = λ max – λ mid . (2) Notice that a ball has a zero saliency. As for the size of a box, it is simply described by two scalars. (B) Communication: we then need to know how these tensors communicate and propagate their information to their neighbors. This is called the voting process. The site (here the box) which propagates its information is the voting token and the point that receives it, the receiver.
add to the existing tensor in M′ the tensor F M ( M′ ) . The key point is then the tensor fields. Their role is to infer the continuity properties desired, such as acceleration consistency. The domain of definition of the tensors fields is oriented toward the time t=ct direction: a voting token at time t (a box) will not vote M for points in the same frame, since these boxes move along time. To define them more preDM cisely, two issues have to be considered. The first one is the orientation which is the velocity inferred by a voting token to a receiver (Figure 1). Two models have been considered: without acceleration, or with a constant one, if we know the velocity of the voting token.
2.3 2D+t Tensor Voting based approach We present in Figure 2 an overview of the approach. For clarity, only the centers of the boxes are represented. (A) The tensor based filter. The description of a moving region by the center of its bounding box is compact but very sensitive to noise, and occlusions, which may bias the positions (see arrows in Figure 4 (a)). When small boxes appear (possibly coming from a detection
?
3
The influence of a voting token M ∈ ℜ is defined by a 3
tensor field F:D M ⊂ ℜ → ϒ where D M is the domain of
Time t
Time t’
Figure 3. The uncertainty region
Tensor Voting based approach Detected moving regions
Tracked regions
Densification & 2D+t representation
Tensor-based filtering
Trajectory extraction
Velocity estimation
Figure 2. Synoptic of the approach. The temporal axis is enhanced error or an occlusion), the idea consists in associating to images of size 160 by 128 pixels). The segmentation gives their center an uncertainty region according to the boxes 115 boxes (23s). The filter is the most time consuming in the temporal neighborhood (Figure 3). A voting process (292s). The reason is that the votes have to be collected in is then done in these regions, and for each of them, the uncertainties regions that may be large. The velocity estimost salient box is kept. (W-1) and (W-4) are used. This mation is very fast since it is a sparse vote (2s). Finally, we pre-processing is necessary especially when occlusions can notice that the densification and the curve extraction occur, and may result in an important redefinition of the steps can be done together quite efficiently (28s). boxes (Figure 4 (b)). Notice that, hereafter, not only the 4 Conclusion positions but also the size of the boxes are computed, using a scalar vote. We have presented an original approach for tracking (B) Velocity estimation. Two steps are involved: crebased on the tensor voting methodology. As shown in the ate a first approximation using a linear trajectory hypotheexamples, this approach permits to track efficiently the sis and (W-1)-(W-4). This estimation is then improved by objects, even with strong occlusions or starting with a second vote assuming constant acceleration trajectories biased detections. Future developments include speed and (W-1)-(W-2)-(W-3)-(W-4). improvement, video-stream analysis and moving camera (C) Densification and Trajectory extraction. We sequences. first vote at every point in the 2D+t domain to obtain a dense map of 3D tensors. Constant acceleration trajectoReferences ries and (W-1)-(W-2)-”(W-3)-(W-4)” are used. (W-4) is used only in case of high overlaps. Otherwise, we ignore it [1] B. Bascle, P. Bouthemy, R. Deriche and F. Meyer, “Trackand give a privilege the direction (W-3), in order to handle ing complex primitives in an image sequence”, In ICPR-A, pp. 426-431, Oct. 1994. occlusions or disappearance situations. The last step is the [2] A. Baumberg and D. Hogg, “An adaptative eigenshape trajectory extraction, by “linking” the most salient feamodel”, Proc. British Vision Conf., Birmingham, Sept. 1995. tures. We refer to [17] for more details and mention that [3] F. Brill, T. Olson, C. Tseng, Event Recognition and Relisimplifications can be made thanks to the anisotropy of ability Improvements for the Autonomous Video Surveilthe “3D” space that we are considering. lance System, Proc. DARPA IUW98, pp. 267-283, 1998.
3
Results
Figure 4 and Figure 5 show the results. More results are proposed in http://www-sop.inria.fr/robotvis/personnel/pkornp/pkornp-eng.html Let us give an idea of the actual computational times for each step (Experiments are done on Sun Ultra 1, Model 170) for the sequence described in Figure 5 (55
[4] I. Cox and S. Hingorani, “An efficient implementation of Reid’s multiple hypothesis tracking algorithm and its evaluation for the purpose of visual tracking”, IEEE Trans. Pattern Anal. Machine Intell., Vol 18, No 2, Sept. 1996. [5] I. Cohen and G. Medioni, “Detecting and tracking moving objects for video-surveillance”, In CVPR, Vol 2, pp 319325,1999 [6] L. Davis, R. Chellappa, A. Rosenfeld, D. Harwood, I. Haritaoglu, R. Cutler, “Visual Surveillance and Monitoring”, Proc. DARPA98, pp. 73-76, 1998.
(a) Two views of the 2D+t boxes (366)
(b) Filtered boxes (288)
(c) Infered trajectories
Figure 4. Running example. Top: Connected components of the moving regions (represented by their bounding boxes). Middle: Illustration of differents steps of the 2D+t approach. Bottom: Tracking result.
Figure 5. The “walking in sweden” sequence. Reflections on the ground create some instabilities. [13] G. Medioni, M.S. Lee and C.K. Tang, “A computational [7] G. Guy and G. Medioni, “Inferencing global perceptual framework for early vision”, Elsevier, Dec. 1999. contours from local features”, Int J. Computer Vision, Vol. [14] N. Paragios and R. Deriche, “A PDE-based Level Set 20, no. 1, pp. 113-133, 1996. Approach for Detection and Tracking of Moving Objects”, [8] R.J. Howarth, H. Buxton, Visual Surveillance Monitoring INRIA Research Report No 3173 - May 1997. and Watching, In ECCV96, Vol II, pp. 321-334, 1996. [15] PETS 2000. Proceedings of the First Workshop on Perfor[9] D.P. Huttenlocher, J.J. Noh and W.J. Rucklidge, “Tracking mance Evaluation of Tracking and Surveillance, Grenoble, nonrigid objects in complex scenes”,ICCV, pp. 93-101, 1993. March 2000. [10] S. Jehan, E. Debreuve, M. Barlaud, G. Aubert, “Space-time [16] I.K. Sethi, and R.C. Jain, “Finding Trajectories of Feature segmentation of moving objects in a sequence of images Points in a Monocular Image Sequence”, PAMI, Vol 9, No. 1, using active contours”, to appear in RFIA 2000. pp. 56-73, Jan. 1987. [11] D. Koller, K. Danilidis and H. Nagel, “Model-based object [17] C.-K. Tang and Gerard Medioni, “Inference of integrated tracking in monocular image sequences of road traffic surface, curve, and junction descriptions from sparse 3D sequences”, Int J. Computer Vision, Vol 10, no. 3, pp. 257data”, IEEE PAMI, Vol 20, no. 11, pp. 1206-1223, 1998. 281, 1993. [18] K. Toyama, J. Krumm, B. Brumitt, B. Meyers. "Wallflower: [12] P. Kornprobst, R. Deriche and G. Aubert, “Image Sequence Principles and Practice of Background Maintenance", In Analysis via Partial Differential Equations”, Journal of ICCV, pp. 255-261 1999. Mathematical Imaging and Vision, Vol 11, no 1, pp. 5-26, September 1999.